amity-voice-segmentation-01

This model is a fine-tuned version of pyannote/segmentation-3.0 on the amityco/sample-voice-12-records and private datasets from Real Contact-Center Data. It achieves the following results on the evaluation set:

Loss: 0.1038
Model Preparation Time: 0.0025
Der: 0.0468
False Alarm: 0.0280
Missed Detection: 0.0188
Confusion: 0.0

from diarizers import SegmentationModel
from pyannote.audio import Pipeline
from datasets import load_dataset
import torch

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

# load the pre-trained pyannote pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
pipeline.to(device)

# replace the segmentation model with your fine-tuned one
model = SegmentationModel().from_pretrained("amityco/amity-voice-segmentation-01")
model = model.to_pyannote_model()
pipeline._segmentation.model = model.to(device)

diarization = pipeline("audio.wav")

Evaluation Result

Our initial baseline for speaker diarization was the widely-used pyannote.audio pipeline, leveraging pretrained speaker embeddings and spectral clustering. On the Insurance dataset, this vanilla approach, without any Thai-specific adaptation, produced a Diarization Error Rate (DER) of 31.3%, with principal errors in distinguishing overlapping and code-switched speakers.

Our fine-tuned approach building on this, we paired our fine-tuned Thonburian Whisper- Large as the acoustic front-end for segmentation, with the Diarizer clustering library. This yielded a marked improvement, dropping DER to 8.13% on the same evaluation set.

MODEL	Diarization Error Rate (Lower is Better)
Pyannote Segmenation-3.0	31.3%
Amity Voice-segmentation-01	8.13%

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 8.5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
num_epochs: 20.0

Training results

Training Loss	Epoch	Step	Validation Loss	Model Preparation Time	Der	False Alarm	Missed Detection	Confusion
0.3728	1.0	54	0.1171	0.0025	0.0584	0.0207	0.0374	0.0003
0.3746	2.0	108	0.1112	0.0025	0.0522	0.0248	0.0272	0.0003
0.3479	3.0	162	0.1093	0.0025	0.0520	0.0258	0.0261	0.0
0.3252	4.0	216	0.1075	0.0025	0.0541	0.0285	0.0256	0.0
0.3675	5.0	270	0.1072	0.0025	0.0530	0.0291	0.0240	0.0
0.3123	6.0	324	0.1063	0.0025	0.0495	0.0288	0.0207	0.0
0.3033	7.0	378	0.1056	0.0025	0.0495	0.0285	0.0210	0.0
0.3105	8.0	432	0.1052	0.0025	0.0490	0.0280	0.0210	0.0
0.322	9.0	486	0.1045	0.0025	0.0482	0.0283	0.0199	0.0
0.3368	10.0	540	0.1040	0.0025	0.0485	0.0280	0.0205	0.0
0.2956	11.0	594	0.1039	0.0025	0.0476	0.0280	0.0197	0.0
0.2914	12.0	648	0.1041	0.0025	0.0476	0.0280	0.0197	0.0
0.2812	13.0	702	0.1039	0.0025	0.0476	0.0280	0.0197	0.0
0.3048	14.0	756	0.1040	0.0025	0.0474	0.0280	0.0194	0.0
0.283	15.0	810	0.1039	0.0025	0.0471	0.0280	0.0191	0.0
0.2952	16.0	864	0.1038	0.0025	0.0468	0.0280	0.0188	0.0
0.3061	17.0	918	0.1038	0.0025	0.0468	0.0280	0.0188	0.0
0.3122	18.0	972	0.1038	0.0025	0.0468	0.0280	0.0188	0.0
0.2941	19.0	1026	0.1038	0.0025	0.0468	0.0280	0.0188	0.0
0.2782	20.0	1080	0.1038	0.0025	0.0468	0.0280	0.0188	0.0

Framework versions

Transformers 4.51.3
Pytorch 2.6.0+cu124
Datasets 3.5.0
Tokenizers 0.21.1

Downloads last month: 1

Safetensors

Model size

1.47M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amityco/amity-voice-segmentation-01

Base model

pyannote/segmentation-3.0

Finetuned

(82)

this model

Evaluation results

Metadata error: specify a dataset to view leaderboard