community-1 speaker diarization

This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization.

  • stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
  • audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

The main improvements brought by Community-1 are:

  • improved speaker assignment and counting
  • simpler reconciliation with transcription timestamps with exclusive speaker diarization
  • easy offline use (i.e. without internet connection)
  • (optionally) hosted on pyannoteAI cloud

Setup

  1. pip install pyannote.audio
  2. Accept user conditions
  3. Create access token at hf.co/settings/tokens.

Quick start

# download the pipeline from Huggingface
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1", 
    token="{huggingface-token}")

# run the pipeline locally on your computer
output = pipeline("audio.wav")

# print the predicted speaker diarization 
for turn, speaker in output.speaker_diarization:
    print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")

Benchmark

Out of the box, Community-1 is much better than speaker-diarization-3.1.

We report diarization error rates (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).

Benchmark (last updated in 2025-09) legacy (3.1) community-1 precision-2
AISHELL-4 12.2 11.7 11.4
AliMeeting (channel 1) 24.5 20.3 15.2
AMI (IHM) 18.8 17.0 12.9
AMI (SDM) 22.7 19.9 15.6
AVA-AVD 49.7 44.6 37.1
CALLHOME (part 2) 28.5 26.7 16.6
DIHARD 3 (full) 21.4 20.2 14.7
Ego4D (dev.) 51.2 46.8 39.0
MSDWild 25.4 22.8 17.3
RAMC 22.2 20.8 10.5
REPERE (phase2) 7.9 8.9 7.4
VoxConverse (v0.3) 11.2 11.2 8.5

Precision-2 model is even better and can be tested like this:

  1. Create an API key on pyannoteAI dashboard (free credits included)
  2. Change one line of code
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
-     'pyannote/speaker-diarization-community-1', token="{huggingface-token}")
+     'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}")
diarization = pipeline("audio.wav")  # runs on pyannoteAI servers

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch
pipeline.to(torch.device("cuda"))

Processing from memory

Pre-loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav")
output = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    output = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

output = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

output = pipeline("audio.wav", min_speakers=2, max_speakers=5)

Exclusive speaker diarization

Community-1 pretrained pipeline returns a new exclusive speaker diarization, on top of the regular speaker diarization, available as output.exclusive_speaker_diarization.

This is a feature which is backported from our latest commercial model that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.

Offline use

  1. In the terminal, copy the pipeline on disk:
# make sure git-lfs is installed (https://git-lfs.com)
git lfs install

# create a directory on disk
mkdir /path/to/directory

# when prompted for a password, use an access token with write permissions.
# generate one from your settings: https://huggingface.co/settings/tokens
git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
  1. In Python, use the pipeline without internet connection:
# load pipeline from disk (works without internet connection)
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')

# run the pipeline locally on your computer
output = pipeline("audio.wav")

Citations

  1. Speaker segmentation model
@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}
  1. Speaker embedding model
@inproceedings{Wang2023,
  title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
  author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
  booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}
  1. Speaker clustering
@article{Landini2022,
  author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
  title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}},
  year={2022},
  journal={Computer Speech \& Language},
}

Acknowledgment

Training and tuning made possible thanks to GENCI on the Jean Zay supercomputer.

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support