Smart Turn v2
Smart Turn v2 is an open‑source semantic Voice Activity Detection (VAD) model that tells you whether a speaker has finished their turn by analysing the raw waveform, not the transcript.
Compared with v1 it is:
- Multilingual – 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
- 6 × smaller – ≈ 360 MB vs. 2.3 GB.
- 3 × faster – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.
Intended use & task
Use‑case | Why this model helps |
---|---|
Voice agents / chatbots | Wait to reply until the user has actually finished speaking. |
Real‑time transcription + TTS | Avoid “double‑talk” by triggering TTS only when the user turn ends. |
Call‑centre assist & analytics | Accurate segmentation for diarisation and sentiment pipelines. |
Any project needing semantic VAD | Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD. |
The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.
Model architecture
- Backbone :
wav2vec2
encoder - Head : shallow linear classifier
- Params : 94.8 M (float32)
- Checkpoint: 360 MB Safetensors (compressed)
Thewav2vec2 + linear
configuration out‑performed LSTM and deeper transformer variants during ablation studies.
Training data
Source | Type | Split | Languages |
---|---|---|---|
human_5_all |
Human‑recorded | Train / Dev / Test | EN |
chirp3_1 |
Synthetic (Google Chirp3 TTS) | Train / Dev / Test | 14 langs |
- Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
- Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.
All audio/text pairs are released on the pipecat‑ai/datasets hub.
Evaluation & performance
Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)
Lang | Acc % | Lang | Acc % |
---|---|---|---|
EN | 94.3 | IT | 94.4 |
FR | 95.5 | KO | 95.5 |
ES | 92.1 | PT | 95.5 |
DE | 95.8 | TR | 96.8 |
NL | 96.7 | PL | 94.6 |
RU | 93.0 | HI | 91.2 |
ZH | 87.2 | – | – |
Human English benchmark (human_5_all
) : 99 % accuracy.
Inference latency for 8 s audio
Device | Time |
---|---|
NVIDIA L40S | 12 ms |
NVIDIA A100 | 19 ms |
NVIDIA T4 (AWS g4dn.xlarge) | 75 ms |
16‑core x86 CPU (Modal) | 410 ms |
How to use – quick start
from transformers import pipeline
import soundfile as sf
pipe = pipeline(
"audio-classification",
model="pipecat-ai/smart-turn-v2",
feature_extractor="facebook/wav2vec2-base"
)
speech, sr = sf.read("user_utterance.wav")
if sr != 16_000:
raise ValueError("Resample to 16 kHz")
result = pipe(speech, top_k=None)[0]
print(f"Completed turn? {result['label']} Prob: {result['score']:.3f}")
# label == 'complete' → user has finished speaking
- Downloads last month
- 1,390
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support