Model Overview
Description:
Parakeet-Realtime-EOU-120m-v1 is a streaming speech recognition model that also performs end-of-utterance (EOU) detection. It achieves low latency (80ms~160 ms) and signals EOU by emitting an <EOU> token at the end of each utterance. The model supports only English and does not output punctuation or capitalization.
This model is designed for use in voice AI agent pipelines (e.g., NeMo Voice Agent):
This model is ready for commercial/non-commercial use.
License/Terms of Use
Model Architecture:
Architecture Type: FastConformer-RNNT [1]
Network Architecture: cache-aware streaming FastConformer [2] with 17 encoder layers (attention context = [70,1]) and RNNT decoder.
Number of model parameters: 120M
Input:
Input Type(s): Audio
Input Format: Audio waveform
Input Parameters: 1-Dimensional
Other Properties Related to Input: Single-channel audio in 16kHz sampling rate, at least 160ms duration is required.
Output:
Output Type(s): Text with optional <EOU> token (e.g., "what is your name<EOU>")
Output Format: String
Output Parameters: 1-Dimensional
Other Properties Related to Output: The output text might be empty if input audio doesn't contain any speech.
References(s):
[1] Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition
[2] Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
[3] NVIDIA NeMo Toolkit
How to use this model
Streaming usage with NeMo Voice Agent
This model is primarily designed for use in voice AI agents under streaming settings. Please refer to NeMo Voice Agent for examples on how to setup up a voice agent with 80ms ASR latency.
To use this model in NeMo Voice Agent, set this in the server config yaml:
stt:
type: nemo
model: "nvidia/parakeet_realtime_eou_120m-v1"
Offline usage
You will need to install NVIDIA NeMo [3]. We recommend you install it after you've installed latest PyTorch version.
pip install -U nemo_toolkit['asr']
The model can then be used in the offline setting showned below.
Automatically instantiate the model
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet_realtime_eou_120m-v1")
Transcribing using Python
First, let's get a sample
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Then simply do:
output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)
Software Integration:
Runtime Engine(s):
- NeMo 2.5.3+
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Volta
Preferred/Supported Operating System(s):
- Linux
Model Version(s):
- parakeet_realtime_eou_120m-v1
Training, Testing, and Evaluation Datasets:
Training Dataset:
- AMI
- DialogStudio (subset from task-oriented domain with commercial license)
- Granary
- Google Speech Commands
- LibriTTS
- 10,000 hours from human-transcribed NeMo ASR Set 3.0, including:
- LibriSpeech (960 hours)
- Fisher Corpus
- National Speech Corpus Part 1
- VCTK
- Europarl-ASR
- Multilingual LibriSpeech
- Mozilla Common Voice (v7.0)
** Data Collection Method
- [Hybrid: Human, Synthetic] - Most audios are human recorded, but some are generated by TTS models with commercial license
** Labeling Method
- [Hybrid: Human, Synthetic] - Some transcripts are automatically generated by automatic speech recognition (ASR) models, while others are manually labeled.
Evaluation Dataset:
- HuggingFace ASR Leaderboard
- AMI
- Earnings22
- Gigaspeech
- LS-test-clean
- LS-test-other
- SPGI
- Tedlium
- Voxpopuli
- DialogStudio (subset from task-oriented domain with commercial license)
** Data Collection Method
- [Hybrid: Human, Synthetic] - Most audios are human recorded, but some are generated by TTS models with commercial license
** Labeling Method
- [Hybrid: Human, Synthetic] - Some transcripts are generated by ASR models, while some are manually labeled
Benchmark Score
Speech Recognition (Word Error Rate)
Word error rate (WER) on HuggingFace OpenASR leaderboard measured in 160ms streaming setting. Text is normalized by this normalizer before caculating the metrics.
| Metric | Average | AMI | Earnings22 | Gigaspeech | LS-test-clean | LS-test-other | SPGI | Tedlium | Voxpopuli |
|---|---|---|---|---|---|---|---|---|---|
| WER (%) | 9.30 | 15.62 | 15.76 | 13.31 | 3.61 | 7.79 | 3.79 | 5.48 | 9.07 |
End-of-Utterance Detection (Latency)
The latency metrics are evaluated on TTS generated audios from DialogStudio, and a 3-second silence is appended to each sample. The actual performance on real-world scenarios will vary by acoustic environment, accents, etc.
| Percentile | Latency |
|---|---|
| 50% | 160ms |
| 90% | 280ms |
| 95% | 320ms |
Inference:
Acceleration Engine: CUDA
Test Hardware:
- NVIDIA V100
- NVIDIA A100
- NVIDIA A6000
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 991