π§ High-Accuracy ASR Model for Swahili for Clean Speech
This automatic speech recognition (ASR) model is trained using open multilingual datasets to provide high-accuracy transcription for clean, read-aloud Swahili speech.
It is ideal for tasks involving clean and well-structured and clean speech input, such as reading assistants, or general-purpose multilingual transcription.
This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. π View all models on GitHub
We are particularly interested in validating the conclusions weβve observed through our ablation studies:
While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech β especially for underrepresented languages like Swahili and Yoruba. We are inviting the community to try out these models and help assess:
- How well the models perform on natural, conversational, or noisy audio
- Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
- Whether the improvements we've seen in combining diverse datasets generalize to your use case
- Gaps between benchmark results and real-world usability
- A combination of both yields balanced results but depends on data quality and label accuracy.
Model
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
π How to Use
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Whisper_Largev2-Swahili-Decodis_Base", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
pipe = pipeline(
"automatic-speech-recognition",
model=model,
processor = "openai/whisper-large-v2",
tokenizer = "openai/whisper-large-v2",
feature_extractor = "openai/whisper-large-v2",
chunk_length_s=15,
device=device,
model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
generate_kwargs = {
'num_beams':5,
'max_new_tokens':440,
'early_stopping':True,
'repetition_penalty': 1.8,
'language': 'swahili',
'task': 'transcribe'
}
)
text_output = pipe("audio.wav")['text']
π Total Duration: ~400 hours
π Languages: Swahili (sw
)
ποΈββοΈ Training Setup
- Architecture:
whisper-large-v2
- Framework: Whisper and Huggingface Transformers
- Sampling rate: 16 kHz
- Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming
- Learning Rate: 1e-5
- Optimizer: Adamw_pytorch
- Steps: 3000
π¦ Evaluation Data
- FLEURS
- Decodis Test Set (Collected by DECODIS)
π Evaluation Metric (WER)
Dataset | This Model | Whisper Large V2 |
---|---|---|
FLEURS (benchmark) | 13.31 | 39.40 |
Our test set | 69.86 | 99.98 |
π― Intended Use
This model performs best in:
- Read or dictated speech
- Clean environments with minimal noise
- Evaluation benchmarks like FLEURS
Not recommended for real-world noisy conditions without domain adaptation.
β οΈ Limitations
- Poor generalization to conversational or spontaneous speech
- Sensitive to background noise and overlapping speakers
- Accents outside training data may reduce accuracy
π Please try the models and share your feedback, issues, or results via:
GitHub Issues: Submit an issue
Hugging Face Discussions: Join the conversation
Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.
- Downloads last month
- 527
Model tree for RafatK/Whisper_Largev2-Swahili-Decodis_Base
Base model
openai/whisper-large-v2