DECODIS_Website

🧠 High-Accuracy ASR Model for Yoruba for Clean Speech

Main code Main code

This automatic speech recognition (ASR) model is trained using three open multilingual datasets to provide high-accuracy transcription for clean, read-aloud Yoruba speech.

It is ideal for tasks involving clean and well-structured and clean speech input, such as reading assistants, or general-purpose multilingual transcription.

This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. πŸ‘‰ View all models on GitHub

We are particularly interested in validating the conclusions we’ve observed through our ablation studies:

While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech β€” especially for underrepresented languages like Swahili and Yoruba. We are inviting the community to try out these models and help assess:

  1. How well the models perform on natural, conversational, or noisy audio
  2. Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
  3. Whether the improvements we've seen in combining diverse datasets generalize to your use case
  4. Gaps between benchmark results and real-world usability
  5. A combination of both yields balanced results but depends on data quality and label accuracy.

Model

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.


πŸš€ How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Whisper_Largev2-Yoruba-Decodis_Base", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
forced_decoder_ids = processor.get_decoder_prompt_ids(language="swahili", task="transcribe")

pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
    processor = "openai/whisper-large-v2",
    tokenizer = "openai/whisper-large-v2",
    feature_extractor = "openai/whisper-large-v2",
  chunk_length_s=15,
  device=device,
  model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
  generate_kwargs = {
                       'num_beams':5,
                       'max_new_tokens':440,
                       'early_stopping':True,
                      'language': 'english',
                      'task': 'transcribe'
                       }
)
text_output = pipe("audio.wav")['text']

πŸ“Š Total Duration: ~40 hours


πŸ“ Languages: Yoruba (yo)

πŸ‹οΈβ€β™‚οΈ Training Setup

  • Architecture: whisper-large-v2
  • Framework: Whisper and Huggingface Transformers
  • Sampling rate: 16 kHz
  • Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming
  • Learning Rate: 1e-5
  • Optimizer: Adamw_pytorch
  • Steps: 3000

πŸ“¦ Evaluation Data

  • FLEURS
  • DPP Test Set (Collected by DECODIS)

πŸ“ˆ Evaluation Metric (WER)

Dataset This Model Whisper Large V2
FLEURS (benchmark) 26.18 No-info
Our test set 81.29 no-info

🎯 Intended Use

This model performs best in:

  • Read or dictated speech
  • Clean environments with minimal noise
  • Evaluation benchmarks like FLEURS

Not recommended for real-world noisy conditions without domain adaptation.


⚠️ Limitations

  • Poor generalization to conversational or spontaneous speech
  • Sensitive to background noise and overlapping speakers
  • Accents outside training data may reduce accuracy

πŸ“ Please try the models and share your feedback, issues, or results via:

GitHub Issues: Submit an issue

Hugging Face Discussions: Join the conversation

Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.


Downloads last month
155
Safetensors
Model size
1.54B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RafatK/Whisper_Largev2-Yoruba-Decodis_Base

Finetuned
(209)
this model