π°πͺ Model Card for RareElf/swahili-wav2vec2-asr
This model is a fine-tuned version of eddiegulay/wav2vec2-large-xlsr-mvc-swahili
for automatic speech recognition (ASR) in Swahili. It has been trained using the Common Voice 11.0 Swahili dataset.
π Model Details
Model Description
This model leverages the wav2vec2
architecture from Facebook AI, fine-tuned for Swahili ASR. It maps raw speech waveforms sampled at 16kHz to transcriptions using a CTC (Connectionist Temporal Classification) loss. It supports real-time transcription for voice-based Swahili applications.
- Developed by: Kevin Obote / RareElf
- Funded by: Internal research at Guild Code
- Shared by: RareElf
- Model type: Automatic Speech Recognition (ASR)
- Language(s) (NLP): Swahili (
sw
) - License: Apache-2.0
- Finetuned from model:
eddiegulay/wav2vec2-large-xlsr-mvc-swahili
Model Sources [optional]
- Repository: https://huggingface.co/RareElf/swahili-wav2vec2-asr
- Paper [optional]: Coming soon
- Demo [optional]: Coming soon on semasasa.ai
Uses
Direct Use
This model can be used for:
- Transcribing Swahili audio for accessibility, journalism, documentation, education, etc.
- Integration into chatbots or voice agents in Swahili.
Downstream Use [optional]
- Can be integrated with translation and sentiment analysis pipelines.
- Useful for fine-tuning on domain-specific Swahili data (e.g.education, healthcare, government).
Out-of-Scope Use
- Not suitable for noisy, far-field, or multi-speaker environments without preprocessing.
- Not recommended for use in legal, medical, or high-stakes domains without additional validation.
Bias, Risks, and Limitations
- The model may underperform on underrepresented dialects of Swahili.
- Accents, noisy recordings, and overlapping speech may impact accuracy.
- Reflects the linguistic distribution of Common Voice contributors, which may not be representative of all Swahili speakers.
Recommendations
- Preprocess noisy audio for best results.
- Fine-tune further on targeted domain data for production use.
- Provide user disclaimers about ASR limitations in live deployments.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa
model_id = "RareElf/swahili-wav2vec2-asr"
processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)
audio, _ = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, return_tensors="pt", sampling_rate=16000).input_values
with torch.no_grad():
logits = model(inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
Training Details
Training Data
- Dataset: Common Voice 11.0 β Swahili subset
Training Procedure
- Fine-tuned using
Trainer
API from π€ Transformers. - Loss: CTC (Connectionist Temporal Classification)
- Optimizer: AdamW
- Precision:
fp16
(mixed precision)
Preprocessing [optional]
- Resampled to 16kHz
- Normalized text
- Removed empty, corrupted, or misaligned samples
Training Hyperparameters
- Training regime:
- Epochs: 10
- Batch Size: 16
- Learning Rate: 3e-4
- Warmup Steps: 500
- Weight Decay: 0.01
- Gradient Accumulation: 2
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Dataset: Held-out subset of Common Voice Swahili
Factors
[More Information Needed]
Metrics
- WER (Word Error Rate)
- BLEU (for translation use-case)
- ROUGE (for paraphrase quality)
Results
Metric | Score |
---|---|
WER | 0.33 |
BLEU | 0.44 |
ROUGE | 0.66 |
Note: Evaluation scores are being finalized with the full test set.
Summary
Model Examination [optional]
- Visualized attention maps confirm the model learns phonetic and acoustic patterns relevant to Swahili.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: ***
- Hours used: ~10
- Cloud Provider: Google Cloud
- Compute Region: ****
- Carbon Emitted: ~X gCO2eq (estimation via mlco2)
Technical Specifications [optional]
Model Architecture and Objective
- Architecture: Wav2Vec2 (base) + CTC head
- Objective: Predict character-level transcription from 16kHz audio
Compute Infrastructure
[More Information Needed]
Hardware
** Personal Computer Lenovo ThinkPad T14 Gen 1 (32GB RAM) 1TB SSD
Software
- Python 3.10
- PyTorch 2.x
- Transformers 4.39.x
- Datasets 2.x
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
ASR: Automatic Speech Recognition
WER: Word Error Rate
CTC: Connectionist Temporal Classification
More Information [optional]
Model Card Authors [optional]
- Kevin Obote/ RareElf / Guild Code Team
Model Card Contact
- Email: [email protected]
- GitHub: Kevin Obote
- Downloads last month
- 41
Model tree for RareElf/swahili-wav2vec2-asr
Base model
facebook/wav2vec2-large-xlsr-53