Smart Turn v2
Smart Turn v2 is an open‑source semantic Voice Activity Detection (VAD) model that tells you whether a speaker has finished their turn by analysing the raw waveform, not the transcript.
Compared with v1 it is:
- Multilingual – 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
- 6 × smaller – ≈ 360 MB vs. 2.3 GB.
- 3 × faster – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.
Links
- Blog post: Smart Turn v2
- GitHub repo with training and inference code
Intended use & task
Use‑case | Why this model helps |
---|---|
Voice agents / chatbots | Wait to reply until the user has actually finished speaking. |
Real‑time transcription + TTS | Avoid “double‑talk” by triggering TTS only when the user turn ends. |
Call‑centre assist & analytics | Accurate segmentation for diarisation and sentiment pipelines. |
Any project needing semantic VAD | Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD. |
The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.
Model architecture
- Backbone :
wav2vec2
encoder - Head : shallow linear classifier
- Params : 94.8 M (float32)
- Checkpoint: 360 MB Safetensors (compressed)
Thewav2vec2 + linear
configuration out‑performed LSTM and deeper transformer variants during ablation studies.
Training data
Source | Type | Languages |
---|---|---|
human_5_all |
Human‑recorded | EN |
human_convcollector_1 |
Human‑recorded | EN |
rime_2 |
Synthetic (Rime) | EN |
orpheus_midfiller_1 |
Synthetic (Orpheus) | EN |
orpheus_grammar_1 |
Synthetic (Orpheus) | EN |
orpheus_endfiller_1 |
Synthetic (Orpheus) | EN |
chirp3_1 |
Synthetic (Google Chirp3 TTS) | 14 langs |
- Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
- Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.
All audio/text pairs are released on the pipecat‑ai/datasets hub.
Evaluation & performance
Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)
Lang | Acc % | Lang | Acc % |
---|---|---|---|
EN | 94.3 | IT | 94.4 |
FR | 95.5 | KO | 95.5 |
ES | 92.1 | PT | 95.5 |
DE | 95.8 | TR | 96.8 |
NL | 96.7 | PL | 94.6 |
RU | 93.0 | HI | 91.2 |
ZH | 87.2 | – | – |
Human English benchmark (human_5_all
) : 99 % accuracy.
Inference latency for 8 s audio
Device | Time |
---|---|
NVIDIA L40S | 12 ms |
NVIDIA A100 | 19 ms |
NVIDIA T4 (AWS g4dn.xlarge) | 75 ms |
16‑core x86_64 CPU (Modal) | 410 ms |
How to use
Please see the blog post and GitHub repo for more information on using the model, either standalone or with Pipecat.
- Downloads last month
- 39,020
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
5
Ask for provider support