πŸ‡°πŸ‡ͺ Model Card for RareElf/swahili-wav2vec2-asr

This model is a fine-tuned version of eddiegulay/wav2vec2-large-xlsr-mvc-swahili for automatic speech recognition (ASR) in Swahili. It has been trained using the Common Voice 11.0 Swahili dataset.


πŸ“‹ Model Details

Model Description

This model leverages the wav2vec2 architecture from Facebook AI, fine-tuned for Swahili ASR. It maps raw speech waveforms sampled at 16kHz to transcriptions using a CTC (Connectionist Temporal Classification) loss. It supports real-time transcription for voice-based Swahili applications.

  • Developed by: Kevin Obote / RareElf
  • Funded by: Internal research at Guild Code
  • Shared by: RareElf
  • Model type: Automatic Speech Recognition (ASR)
  • Language(s) (NLP): Swahili (sw)
  • License: Apache-2.0
  • Finetuned from model: eddiegulay/wav2vec2-large-xlsr-mvc-swahili

Model Sources [optional]

Uses

Direct Use

This model can be used for:

  • Transcribing Swahili audio for accessibility, journalism, documentation, education, etc.
  • Integration into chatbots or voice agents in Swahili.

Downstream Use [optional]

  • Can be integrated with translation and sentiment analysis pipelines.
  • Useful for fine-tuning on domain-specific Swahili data (e.g.education, healthcare, government).

Out-of-Scope Use

  • Not suitable for noisy, far-field, or multi-speaker environments without preprocessing.
  • Not recommended for use in legal, medical, or high-stakes domains without additional validation.

Bias, Risks, and Limitations

  • The model may underperform on underrepresented dialects of Swahili.
  • Accents, noisy recordings, and overlapping speech may impact accuracy.
  • Reflects the linguistic distribution of Common Voice contributors, which may not be representative of all Swahili speakers.

Recommendations

  • Preprocess noisy audio for best results.
  • Fine-tune further on targeted domain data for production use.
  • Provide user disclaimers about ASR limitations in live deployments.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa

model_id = "RareElf/swahili-wav2vec2-asr"

processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

audio, _ = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, return_tensors="pt", sampling_rate=16000).input_values

with torch.no_grad():
    logits = model(inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Training Details

Training Data

  • Dataset: Common Voice 11.0 – Swahili subset

Training Procedure

  • Fine-tuned using Trainer API from πŸ€— Transformers.
  • Loss: CTC (Connectionist Temporal Classification)
  • Optimizer: AdamW
  • Precision: fp16 (mixed precision)

Preprocessing [optional]

  • Resampled to 16kHz
  • Normalized text
  • Removed empty, corrupted, or misaligned samples

Training Hyperparameters

  • Training regime:
  • Epochs: 10
  • Batch Size: 16
  • Learning Rate: 3e-4
  • Warmup Steps: 500
  • Weight Decay: 0.01
  • Gradient Accumulation: 2

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Dataset: Held-out subset of Common Voice Swahili

Factors

[More Information Needed]

Metrics

  • WER (Word Error Rate)
  • BLEU (for translation use-case)
  • ROUGE (for paraphrase quality)

Results

Metric Score
WER 0.33
BLEU 0.44
ROUGE 0.66

Note: Evaluation scores are being finalized with the full test set.

Summary

Model Examination [optional]

  • Visualized attention maps confirm the model learns phonetic and acoustic patterns relevant to Swahili.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: ***
  • Hours used: ~10
  • Cloud Provider: Google Cloud
  • Compute Region: ****
  • Carbon Emitted: ~X gCO2eq (estimation via mlco2)

Technical Specifications [optional]

Model Architecture and Objective

  • Architecture: Wav2Vec2 (base) + CTC head
  • Objective: Predict character-level transcription from 16kHz audio

Compute Infrastructure

[More Information Needed]

Hardware

** Personal Computer Lenovo ThinkPad T14 Gen 1 (32GB RAM) 1TB SSD

Software

  • Python 3.10
  • PyTorch 2.x
  • Transformers 4.39.x
  • Datasets 2.x

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

  • ASR: Automatic Speech Recognition

  • WER: Word Error Rate

  • CTC: Connectionist Temporal Classification

More Information [optional]

Model Card Authors [optional]

  • Kevin Obote/ RareElf / Guild Code Team

Model Card Contact

Downloads last month
41
Safetensors
Model size
315M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RareElf/swahili-wav2vec2-asr

Dataset used to train RareElf/swahili-wav2vec2-asr