Smart Turn v2

Smart Turn v2 is an open‑source semantic Voice Activity Detection (VAD) model that tells you whether a speaker has finished their turn by analysing the raw waveform, not the transcript.
Compared with v1 it is:

  • Multilingual – 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
  • 6 × smaller – ≈ 360 MB vs. 2.3 GB.
  • 3 × faster – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.

Intended use & task

Use‑case Why this model helps
Voice agents / chatbots Wait to reply until the user has actually finished speaking.
Real‑time transcription + TTS Avoid “double‑talk” by triggering TTS only when the user turn ends.
Call‑centre assist & analytics Accurate segmentation for diarisation and sentiment pipelines.
Any project needing semantic VAD Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD.

The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.

Model architecture

  • Backbone : wav2vec2 encoder
  • Head     : shallow linear classifier
  • Params   : 94.8 M (float32)
  • Checkpoint: 360 MB Safetensors (compressed)
    The wav2vec2 + linear configuration out‑performed LSTM and deeper transformer variants during ablation studies.

Training data

Source Type Split Languages
human_5_all Human‑recorded Train / Dev / Test EN
chirp3_1 Synthetic (Google Chirp3 TTS) Train / Dev / Test 14 langs
  • Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
  • Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.

All audio/text pairs are released on the pipecat‑ai/datasets hub.

Evaluation & performance

Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)

Lang Acc % Lang Acc %
EN 94.3 IT 94.4
FR 95.5 KO 95.5
ES 92.1 PT 95.5
DE 95.8 TR 96.8
NL 96.7 PL 94.6
RU 93.0 HI 91.2
ZH 87.2

Human English benchmark (human_5_all) : 99 % accuracy.

Inference latency for 8 s audio

Device Time
NVIDIA L40S 12 ms
NVIDIA A100 19 ms
NVIDIA T4 (AWS g4dn.xlarge) 75 ms
16‑core x86 CPU (Modal) 410 ms

oai_citation:7‡Daily

How to use – quick start

from transformers import pipeline
import soundfile as sf

pipe = pipeline(
    "audio-classification",
    model="pipecat-ai/smart-turn-v2",
    feature_extractor="facebook/wav2vec2-base"
)

speech, sr = sf.read("user_utterance.wav")
if sr != 16_000:
    raise ValueError("Resample to 16 kHz")

result = pipe(speech, top_k=None)[0]
print(f"Completed turn? {result['label']}  Prob: {result['score']:.3f}")
# label == 'complete' → user has finished speaking
Downloads last month
1,390
Safetensors
Model size
94.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train pipecat-ai/smart-turn-v2