Automatic Speech Recognition for Kinyarwanda

Hugging Face Hugging Face License

Model Description

This model is a fine-tuned version of Wav2Vec2-BERT 2.0 for automatic speech recognition (ASR) in Kinyarwanda. It was trained on the Kinyarwanda ASR Track A dataset covering Health, Government, Finance, Education, and Agriculture domains.

  • Developed by: badrex
  • Model type: Speech Recognition (ASR)
  • Language: Kinyarwanda (rw)
  • License: MIT
  • Finetuned from: facebook/w2v-bert-2.0

Model Sources

Direct Use

The model can be used directly for automatic speech recognition of Kinyarwanda audio:

from transformers import Wav2Vec2BertProcessor, Wav2Vec2BertForCTC
import torch
import torchaudio

# load model and processor
processor = Wav2Vec2BertProcessor.from_pretrained("badrex/w2v-bert-2.0-kinyarwanda-asr")
model = Wav2Vec2BertForCTC.from_pretrained("badrex/w2v-bert-2.0-kinyarwanda-asr")

# load audio
audio_input, sample_rate = torchaudio.load("path/to/audio.wav")

# preprocess
inputs = processor(audio_input.squeeze(), sampling_rate=sample_rate, return_tensors="pt")

# inference
with torch.no_grad():
    logits = model(**inputs).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Downstream Use

This model can be used as a foundation for:

  • building voice assistants for Kinyarwanda speakers
  • transcription services for Kinyarwanda content
  • accessibility tools for Kinyarwanda-speaking communities
  • research in low-resource speech recognition

Out-of-Scope Use

  • transcribing languages other than Kinyarwanda
  • real-time applications without proper latency testing
  • high-stakes applications without domain-specific validation

Bias, Risks, and Limitations

  • Domain bias: primarily trained on formal speech from specific domains (Health, Government, Finance, Education, Agriculture)
  • Accent variation: may not perform well on dialects or accents not represented in training data
  • Audio quality: performance may degrade on noisy or low-quality audio
  • Technical terms: may struggle with specialized vocabulary outside training domains

Training Data

The model was fine-tuned on the Kinyarwanda ASR Track A dataset:

  • Size: ~500 hours of transcribed Kinyarwanda speech
  • Domains: Health, Government, Finance, Education, Agriculture
  • Source: Digital Umuganda (Gates Foundation funded)
  • License: CC BY 4.0

Model Architecture

  • Base model: Wav2Vec2-BERT 2.0
  • Architecture: transformer-based with convolutional feature extractor
  • Parameters: ~600M (inherited from base model)
  • Objective: connectionist temporal classification (CTC)

Compute Infrastructure

Citation

@misc{w2v_bert_kinyarwanda_asr,
  author = {Badr M. Abdullah},
  title = {Adapting Wav2Vec2-BERT 2.0 for Kinyarwanda ASR},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/badrex/w2v-bert-2.0-kinyarwanda-asr}
}

@misc{kinyarwanda_asr_track_a,
  title={Kinyarwanda Automatic Speech Recognition Track A},
  author={Digital Umuganda},
  year={2025},
  url={https://www.kaggle.com/competitions/kinyarwanda-automatic-speech-recognition-track-a}
}

Model Card Contact

For questions or issues, please contact via the Hugging Face model repository.

Downloads last month
44
Safetensors
Model size
581M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for badrex/w2v-bert-2.0-kinyarwanda-asr

Finetuned
(318)
this model

Space using badrex/w2v-bert-2.0-kinyarwanda-asr 1