NeMo
asr
speech
tdt
ctc
fastconformer
east-african
salt

Parakeet TDT 0.6B β€” SALT Multilingual ASR

Fine-tuned nvidia/parakeet-tdt-0.6b-v3 on 10 East African languages from the SALT dataset.

Model

  • Architecture: FastConformer encoder (24 layers, 1024 hidden, 600M params) with hybrid TDT+CTC decoding
  • Tokenizer: Merged SentencePiece Unigram (8192 pretrained + 1319 new East African tokens = 9511 total)
  • Training: 61 epochs, fp32, single A100, CosineAnnealing (lr=1e-4, 5k warmup steps)
  • Base model: nvidia/parakeet-tdt-0.6b-v3 (25 European languages, 660k hours)

Languages

LUG (Luganda), ENG (English), ACH (Acholi), LGG (Lugbara), TEO (Ateso), NYN (Runyankole), SWA (Swahili), KIN (Kinyarwanda), MYX (Masaba), XOG (Lusoga)

Results (TDT decoding, normalized text)

Language WER CER Samples
ENG 2.47% 0.87% 101
LUG 16.37% 3.06% 103
TEO 17.50% 4.85% 101
ACH 20.96% 4.49% 101
NYN 28.98% 5.13% 103
LGG 31.62% 5.73% 101
MYX 59.15% 14.04% 98
XOG 56.39% 13.90% 100
KIN 86.91% 33.18% 25
SWA 89.63% 30.41% 25
Overall 46.00% 12.75% 858

Note: Training validation reports val_wer=0.2230 (NeMo internal). Standalone eval shows higher WER β€” discrepancy under investigation (likely NeMo validation pipeline differences). SALT-6 core languages (ENG, LUG, ACH, LGG, TEO, NYN) perform well individually. KIN/SWA/MYX/XOG had limited training data.

Training WandB Metrics

  • val_wer (TDT): 0.2230 at epoch 60 (best)
  • val_wer_ctc: ~0.35 at convergence
  • train_rnnt_loss: ~0.8 at convergence
  • train_ctc_loss: ~2.0 at convergence

Usage

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from("parakeet-tdt-salt.nemo")
transcription = model.transcribe(["audio.wav"])

Files

  • parakeet-tdt-salt.nemo β€” Full NeMo checkpoint (model + tokenizer + config)
  • best-epoch60.ckpt β€” Best epoch weights (val_wer=0.2230)
  • tokenizer/ β€” Merged SentencePiece tokenizer files

Comparison

Model Params Overall WER Notes
Whisper-large-v3 seq2seq 1.5B 20.70% Baseline
Parakeet TDT v3 (this) 600M 22.30%* *NeMo val_wer; standalone eval TBD
MMS-1B CTC + KenLM 963M 22.09% Best CTC
MMS-300M CTC + KenLM 300M 23.30%
W2V-BERT 2.0 CTC + KenLM 580M 24.79%

Citation

Part of the Sunbird AI speech project.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for akera/parakeet-tdt-salt

Finetuned
(21)
this model

Datasets used to train akera/parakeet-tdt-salt