Model Card for Kinongono-Whisper-Large-V3

Model Details

Model Description

Kinongono-Whisper-Large-V3 is a fine-tuned version of OpenAI's Whisper Large-V3 model, specifically adapted for enhanced speech recognition in Swahili and English using the SALT framework (Swahili and Associated Languages Techniques). This model demonstrates robust capabilities in transcribing spoken content from audio in both languages with high accuracy.

  • Developed by: Sartify LLC
  • Funded by: Sartify LLC
  • Shared by: Sartify LLC
  • Model type: Speech-to-Text Automatic Speech Recognition (ASR)
  • Language(s): English (eng) and Swahili (swa)
  • License: MIT License
  • Finetuned from model: OpenAI's Whisper Large-V3

Model Sources

Uses

Direct Use

This model is designed for transcribing spoken audio in Swahili and English. It performs particularly well for:

  • Transcription of natural conversations
  • Speech recognition for media content
  • Creating subtitles and captions
  • Documentation of spoken recordings
  • Language preservation and archiving

Downstream Use

The model can be integrated into:

  • Voice assistants with Swahili language support
  • Educational tools for language learning
  • Content creation workflows
  • Accessibility solutions for hearing-impaired users
  • Research applications for linguistics

Out-of-Scope Use

This model is not designed for:

  • Speaker identification or voice biometrics
  • Emotion detection from speech
  • Real-time transcription with minimal latency requirements
  • Languages other than English and Swahili

Bias, Risks, and Limitations

  • The model may exhibit varying levels of accuracy depending on dialects, accents, and regional variations of Swahili.
  • Background noise, poor audio quality, or overlapping speakers may reduce transcription accuracy.
  • The model may not accurately transcribe specialized terminology or uncommon proper nouns.
  • Speech recognition models inherently carry risks of misrepresenting spoken content, which could have consequences in critical applications like legal or medical documentation.

Training Details

Training Data

The model was fine-tuned on a diverse dataset of Swahili and English audio recordings, including:

  • Common Voice datasets
  • Specially curated Swahili speech recordings
  • Additional proprietary data collected with speaker consent

Training Procedure

  • Fine-tuning Framework: Transformers by Hugging Face
  • Base Model: OpenAI's Whisper Large-V3
  • Training Hardware: NVIDIA A100 GPUs
  • Training Approach: Fine-tuning with specialized attention to Swahili phonetics and linguistic structures

Evaluation

Testing Data, Factors & Metrics

The model was evaluated on a separate test set of diverse Swahili and English recordings, measuring:

  • Word Error Rate (WER)
  • Character Error Rate (CER)
  • Transcription latency
  • Performance across different accents and dialects

Results

  • The model achieves a WER of approximately 15% on Swahili content and 11% on English content.
  • Performance varies based on audio quality, with optimal results on clear recordings without background noise.
  • The model shows robustness to mild accents but may struggle with heavy regional variations.

Environmental Impact

  • Estimated Carbon Emissions: The fine-tuning process required approximately [X] GPU hours, resulting in an estimated [Y] kg of COโ‚‚ emissions.
  • Hardware Type: NVIDIA A100 GPUs
  • Location: East Africa data centers

Technical Specifications

Model Architecture and Objective

This model utilizes the Whisper architecture, which is based on an encoder-decoder Transformer. The objective is accurate transcription of spoken language to text.

Input Format

  • Audio in WAV or MP3 format
  • 16kHz sample rate recommended
  • Various durations supported, optimal performance on 5-30 second segments

Output Format

  • Plain text transcription
  • Option to include timestamps for longer content

How to Use

import transformers
import torch
import librosa

# Load model and processor
processor = transformers.WhisperProcessor.from_pretrained("sartifyllc/kinongono-whisper-large-v3")
model = transformers.WhisperForConditionalGeneration.from_pretrained("sartifyllc/kinongono-whisper-large-v3")

# Language tokens for Whisper
SALT_LANGUAGE_TOKENS_WHISPER = {
    'eng': 50259,  # English
    'swa': 50318,  # Swahili
}

# Load and preprocess audio
speech_array, sample_rate = librosa.load("your_audio_file.wav", sr=None)
speech_array = librosa.resample(speech_array, orig_sr=sample_rate, target_sr=16000)
sample_rate = 16000

# Specify language
lang = 'swa'  # or 'eng' for English

# Transcribe
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_features = processor(
    speech_array, sampling_rate=sample_rate, return_tensors="pt"
).input_features
input_features = input_features.to(device)
predicted_ids = model.to(device).generate(
    input_features,
    language=processor.tokenizer.decode(SALT_LANGUAGE_TOKENS_WHISPER[lang]),
    forced_decoder_ids=None
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

Citation

If you use this model in research, please cite:

@misc{sartify2025kinongono,
  author = {Sartify LLC},
  title = {Kinongono-Whisper-Large-V3: Enhanced ASR for Swahili and English},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/sartifyllc/kinongono-whisper-large-v3}}
}

Contact

For questions, support, or to report issues with this model, please contact:

Downloads last month
20
Safetensors
Model size
1.54B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sartifyllc/kinongono-whisper-large-v3

Finetuned
(486)
this model