Model Card for Kyutai STT
See also the project page and the GitHub repository.
This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR). Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript, our model starts to output the transcript as soon as a few seconds of audio become available.
Model Details
The model architecture is a Transformer that consumes audio tokenized by Mimi (see the Moshi paper) and outputs text tokens. The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.
We release two models:
kyutai/stt-1b-en_fr
, an English and French model with ~1B parameters, a 0.5 second delay, and a semantic VAD.kyutai/stt-2.6b-en
, an English-only model with ~2.6B parameters and a 2.5 second delay.
Model Description
Kyutai STT is a decoder-only model for streaming speech-to-text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.
- Developed by: Kyutai
- Model type: Streaming Speech-to-Text transcription.
- Language(s) (NLP): English and French for
kyutai/stt-1b-en_fr
, English forkyutai/stt-2.6b-en
- License: Model weights are licensed under CC-BY 4.0
- Repository: GitHub
Uses
Direct Use
The model can be used for streaming speech-to-text. It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes. The model produces transcripts with capitalization and punctuation. The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset.
How to Get Started with the Model
See the GitHub repository.
Training Details
Training Data
Pretraining stage: For both kyutai/stt-2.6b-en
and kyutai/stt-1b-en_fr
, we use an audio collection of 2.5 million hours of publicly available audio content.
For this dataset, we obtained synthetic transcripts by running whisper-timestamped.
For kyutai/stt-2.6b-en
:
Finetuning stage: We then finetune the model on a collection of public datasets with ground-truth transcripts. This dataset contains 24000 hours of audio.
Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio. The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours).
For kyutai/stt-1b-en_fr
:
- Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French).
Compute Infrastructure
Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.
Model Card Authors
Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez
- Downloads last month
- 22