Model architecture | Model size

Model Overview

Description:

Parakeet-Realtime-EOU-120m-v1 is a streaming speech recognition model that also performs end-of-utterance (EOU) detection. It achieves low latency (80ms~160 ms) and signals EOU by emitting an <EOU> token at the end of each utterance. The model supports only English and does not output punctuation or capitalization.

This model is designed for use in voice AI agent pipelines (e.g., NeMo Voice Agent):

This model is ready for commercial/non-commercial use.

License/Terms of Use

NVIDIA Open Model License

Model Architecture:

Architecture Type: FastConformer-RNNT [1]

Network Architecture: cache-aware streaming FastConformer [2] with 17 encoder layers (attention context = [70,1]) and RNNT decoder.

Number of model parameters: 120M

Input:

Input Type(s): Audio
Input Format: Audio waveform
Input Parameters: 1-Dimensional
Other Properties Related to Input: Single-channel audio in 16kHz sampling rate, at least 160ms duration is required.

Output:

Output Type(s): Text with optional <EOU> token (e.g., "what is your name<EOU>")
Output Format: String
Output Parameters: 1-Dimensional
Other Properties Related to Output: The output text might be empty if input audio doesn't contain any speech.

References(s):

[1] Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition
[2] Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
[3] NVIDIA NeMo Toolkit

How to use this model

Streaming usage with NeMo Voice Agent

This model is primarily designed for use in voice AI agents under streaming settings. Please refer to NeMo Voice Agent for examples on how to setup up a voice agent with 80ms ASR latency.

To use this model in NeMo Voice Agent, set this in the server config yaml:

stt:
  type: nemo
  model: "nvidia/parakeet_realtime_eou_120m-v1"

Offline usage

You will need to install NVIDIA NeMo [3]. We recommend you install it after you've installed latest PyTorch version.

pip install -U nemo_toolkit['asr']

The model can then be used in the offline setting showned below.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet_realtime_eou_120m-v1")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Software Integration:

Runtime Engine(s):

  • NeMo 2.5.3+

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Volta

Preferred/Supported Operating System(s):

  • Linux

Model Version(s):

  • parakeet_realtime_eou_120m-v1

Training, Testing, and Evaluation Datasets:

Training Dataset:

  • AMI
  • DialogStudio (subset from task-oriented domain with commercial license)
  • Granary
  • Google Speech Commands
  • LibriTTS
  • 10,000 hours from human-transcribed NeMo ASR Set 3.0, including:
    • LibriSpeech (960 hours)
    • Fisher Corpus
    • National Speech Corpus Part 1
    • VCTK
    • Europarl-ASR
    • Multilingual LibriSpeech
    • Mozilla Common Voice (v7.0)

** Data Collection Method

  • [Hybrid: Human, Synthetic] - Most audios are human recorded, but some are generated by TTS models with commercial license

** Labeling Method

  • [Hybrid: Human, Synthetic] - Some transcripts are automatically generated by automatic speech recognition (ASR) models, while others are manually labeled.

Evaluation Dataset:

  • HuggingFace ASR Leaderboard
    • AMI
    • Earnings22
    • Gigaspeech
    • LS-test-clean
    • LS-test-other
    • SPGI
    • Tedlium
    • Voxpopuli
  • DialogStudio (subset from task-oriented domain with commercial license)

** Data Collection Method

  • [Hybrid: Human, Synthetic] - Most audios are human recorded, but some are generated by TTS models with commercial license

** Labeling Method

  • [Hybrid: Human, Synthetic] - Some transcripts are generated by ASR models, while some are manually labeled

Benchmark Score

Speech Recognition (Word Error Rate)

Word error rate (WER) on HuggingFace OpenASR leaderboard measured in 160ms streaming setting. Text is normalized by this normalizer before caculating the metrics.

Metric Average AMI Earnings22 Gigaspeech LS-test-clean LS-test-other SPGI Tedlium Voxpopuli
WER (%) 9.30 15.62 15.76 13.31 3.61 7.79 3.79 5.48 9.07

End-of-Utterance Detection (Latency)

The latency metrics are evaluated on TTS generated audios from DialogStudio, and a 3-second silence is appended to each sample. The actual performance on real-world scenarios will vary by acoustic environment, accents, etc.

Percentile Latency
50% 160ms
90% 280ms
95% 320ms

Inference:

Acceleration Engine: CUDA

Test Hardware:

  • NVIDIA V100
  • NVIDIA A100
  • NVIDIA A6000

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
991
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support