Parakeet TDT 0.6B β SALT Multilingual ASR
Fine-tuned nvidia/parakeet-tdt-0.6b-v3 on 10 East African languages from the SALT dataset.
Model
- Architecture: FastConformer encoder (24 layers, 1024 hidden, 600M params) with hybrid TDT+CTC decoding
- Tokenizer: Merged SentencePiece Unigram (8192 pretrained + 1319 new East African tokens = 9511 total)
- Training: 61 epochs, fp32, single A100, CosineAnnealing (lr=1e-4, 5k warmup steps)
- Base model: nvidia/parakeet-tdt-0.6b-v3 (25 European languages, 660k hours)
Languages
LUG (Luganda), ENG (English), ACH (Acholi), LGG (Lugbara), TEO (Ateso), NYN (Runyankole), SWA (Swahili), KIN (Kinyarwanda), MYX (Masaba), XOG (Lusoga)
Results (TDT decoding, normalized text)
| Language | WER | CER | Samples |
|---|---|---|---|
| ENG | 2.47% | 0.87% | 101 |
| LUG | 16.37% | 3.06% | 103 |
| TEO | 17.50% | 4.85% | 101 |
| ACH | 20.96% | 4.49% | 101 |
| NYN | 28.98% | 5.13% | 103 |
| LGG | 31.62% | 5.73% | 101 |
| MYX | 59.15% | 14.04% | 98 |
| XOG | 56.39% | 13.90% | 100 |
| KIN | 86.91% | 33.18% | 25 |
| SWA | 89.63% | 30.41% | 25 |
| Overall | 46.00% | 12.75% | 858 |
Note: Training validation reports val_wer=0.2230 (NeMo internal). Standalone eval shows higher WER β discrepancy under investigation (likely NeMo validation pipeline differences). SALT-6 core languages (ENG, LUG, ACH, LGG, TEO, NYN) perform well individually. KIN/SWA/MYX/XOG had limited training data.
Training WandB Metrics
val_wer(TDT): 0.2230 at epoch 60 (best)val_wer_ctc: ~0.35 at convergencetrain_rnnt_loss: ~0.8 at convergencetrain_ctc_loss: ~2.0 at convergence
Usage
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from("parakeet-tdt-salt.nemo")
transcription = model.transcribe(["audio.wav"])
Files
parakeet-tdt-salt.nemoβ Full NeMo checkpoint (model + tokenizer + config)best-epoch60.ckptβ Best epoch weights (val_wer=0.2230)tokenizer/β Merged SentencePiece tokenizer files
Comparison
| Model | Params | Overall WER | Notes |
|---|---|---|---|
| Whisper-large-v3 seq2seq | 1.5B | 20.70% | Baseline |
| Parakeet TDT v3 (this) | 600M | 22.30%* | *NeMo val_wer; standalone eval TBD |
| MMS-1B CTC + KenLM | 963M | 22.09% | Best CTC |
| MMS-300M CTC + KenLM | 300M | 23.30% | |
| W2V-BERT 2.0 CTC + KenLM | 580M | 24.79% |
Citation
Part of the Sunbird AI speech project.
- Downloads last month
- 19
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for akera/parakeet-tdt-salt
Base model
nvidia/parakeet-tdt-0.6b-v3