🧠 High-Accuracy ASR Model for Yoruba for Clean Speech

This automatic speech recognition (ASR) model is trained using three open multilingual datasets to provide high-accuracy transcription for clean, read-aloud Yoruba speech.

It is ideal for tasks involving clean and well-structured and clean speech input, such as reading assistants, or general-purpose multilingual transcription.

This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. 👉 View all models on GitHub

We are particularly interested in validating the conclusions we’ve observed through our ablation studies:

While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech — especially for underrepresented languages like Swahili and Yoruba. We are inviting the community to try out these models and help assess:

How well the models perform on natural, conversational, or noisy audio
Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
Whether the improvements we've seen in combining diverse datasets generalize to your use case
Gaps between benchmark results and real-world usability
A combination of both yields balanced results but depends on data quality and label accuracy.

Model

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

🚀 How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Whisper_Largev2-Yoruba-Decodis_Base", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
forced_decoder_ids = processor.get_decoder_prompt_ids(language="swahili", task="transcribe")

pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
    processor = "openai/whisper-large-v2",
    tokenizer = "openai/whisper-large-v2",
    feature_extractor = "openai/whisper-large-v2",
  chunk_length_s=15,
  device=device,
  model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
  generate_kwargs = {
                       'num_beams':5,
                       'max_new_tokens':440,
                       'early_stopping':True,
                      'language': 'english',
                      'task': 'transcribe'
                       }
)
text_output = pipe("audio.wav")['text']

📊 Total Duration: ~40 hours

📁 Languages: Yoruba (`yo`)

🏋️‍♂️ Training Setup

Architecture: whisper-large-v2
Framework: Whisper and Huggingface Transformers
Sampling rate: 16 kHz
Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming
Learning Rate: 1e-5
Optimizer: Adamw_pytorch
Steps: 3000

📦 Evaluation Data

FLEURS
DPP Test Set (Collected by DECODIS)

📈 Evaluation Metric (WER)

Dataset	This Model	Whisper Large V2
FLEURS (benchmark)	26.18	No-info
Our test set	81.29	no-info

🎯 Intended Use

This model performs best in:

Read or dictated speech
Clean environments with minimal noise
Evaluation benchmarks like FLEURS

Not recommended for real-world noisy conditions without domain adaptation.

⚠️ Limitations

Poor generalization to conversational or spontaneous speech
Sensitive to background noise and overlapping speakers
Accents outside training data may reduce accuracy

📝 Please try the models and share your feedback, issues, or results via:

GitHub Issues: Submit an issue

Hugging Face Discussions: Join the conversation

Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.

RafatK
/

Whisper_Largev2-Yoruba-Decodis_Base

🧠 High-Accuracy ASR Model for Yoruba for Clean Speech

Model

🚀 How to Use

📁 Languages: Yoruba (`yo`)

🏋️‍♂️ Training Setup

📦 Evaluation Data

📈 Evaluation Metric (WER)

🎯 Intended Use

⚠️ Limitations

Model tree for RafatK/Whisper_Largev2-Yoruba-Decodis_Base

🧠 High-Accuracy ASR Model for Yoruba for Clean Speech

Model

🚀 How to Use

📁 Languages: Yoruba (yo)

🏋️‍♂️ Training Setup

📦 Evaluation Data

📈 Evaluation Metric (WER)

🎯 Intended Use

⚠️ Limitations

Model tree for RafatK/Whisper_Largev2-Yoruba-Decodis_Base

📁 Languages: Yoruba (`yo`)