Model Card for Kinongono-Whisper-Large-V3
Model Details
Model Description
Kinongono-Whisper-Large-V3 is a fine-tuned version of OpenAI's Whisper Large-V3 model, specifically adapted for enhanced speech recognition in Swahili and English using the SALT framework (Swahili and Associated Languages Techniques). This model demonstrates robust capabilities in transcribing spoken content from audio in both languages with high accuracy.
- Developed by: Sartify LLC
- Funded by: Sartify LLC
- Shared by: Sartify LLC
- Model type: Speech-to-Text Automatic Speech Recognition (ASR)
- Language(s): English (eng) and Swahili (swa)
- License: MIT License
- Finetuned from model: OpenAI's Whisper Large-V3
Model Sources
- Repository: GitHub repository
- Model on Hugging Face: sartifyllc/kinongono-whisper-large-v3
Uses
Direct Use
This model is designed for transcribing spoken audio in Swahili and English. It performs particularly well for:
- Transcription of natural conversations
- Speech recognition for media content
- Creating subtitles and captions
- Documentation of spoken recordings
- Language preservation and archiving
Downstream Use
The model can be integrated into:
- Voice assistants with Swahili language support
- Educational tools for language learning
- Content creation workflows
- Accessibility solutions for hearing-impaired users
- Research applications for linguistics
Out-of-Scope Use
This model is not designed for:
- Speaker identification or voice biometrics
- Emotion detection from speech
- Real-time transcription with minimal latency requirements
- Languages other than English and Swahili
Bias, Risks, and Limitations
- The model may exhibit varying levels of accuracy depending on dialects, accents, and regional variations of Swahili.
- Background noise, poor audio quality, or overlapping speakers may reduce transcription accuracy.
- The model may not accurately transcribe specialized terminology or uncommon proper nouns.
- Speech recognition models inherently carry risks of misrepresenting spoken content, which could have consequences in critical applications like legal or medical documentation.
Training Details
Training Data
The model was fine-tuned on a diverse dataset of Swahili and English audio recordings, including:
- Common Voice datasets
- Specially curated Swahili speech recordings
- Additional proprietary data collected with speaker consent
Training Procedure
- Fine-tuning Framework: Transformers by Hugging Face
- Base Model: OpenAI's Whisper Large-V3
- Training Hardware: NVIDIA A100 GPUs
- Training Approach: Fine-tuning with specialized attention to Swahili phonetics and linguistic structures
Evaluation
Testing Data, Factors & Metrics
The model was evaluated on a separate test set of diverse Swahili and English recordings, measuring:
- Word Error Rate (WER)
- Character Error Rate (CER)
- Transcription latency
- Performance across different accents and dialects
Results
- The model achieves a WER of approximately 15% on Swahili content and 11% on English content.
- Performance varies based on audio quality, with optimal results on clear recordings without background noise.
- The model shows robustness to mild accents but may struggle with heavy regional variations.
Environmental Impact
- Estimated Carbon Emissions: The fine-tuning process required approximately [X] GPU hours, resulting in an estimated [Y] kg of COโ emissions.
- Hardware Type: NVIDIA A100 GPUs
- Location: East Africa data centers
Technical Specifications
Model Architecture and Objective
This model utilizes the Whisper architecture, which is based on an encoder-decoder Transformer. The objective is accurate transcription of spoken language to text.
Input Format
- Audio in WAV or MP3 format
- 16kHz sample rate recommended
- Various durations supported, optimal performance on 5-30 second segments
Output Format
- Plain text transcription
- Option to include timestamps for longer content
How to Use
import transformers
import torch
import librosa
# Load model and processor
processor = transformers.WhisperProcessor.from_pretrained("sartifyllc/kinongono-whisper-large-v3")
model = transformers.WhisperForConditionalGeneration.from_pretrained("sartifyllc/kinongono-whisper-large-v3")
# Language tokens for Whisper
SALT_LANGUAGE_TOKENS_WHISPER = {
'eng': 50259, # English
'swa': 50318, # Swahili
}
# Load and preprocess audio
speech_array, sample_rate = librosa.load("your_audio_file.wav", sr=None)
speech_array = librosa.resample(speech_array, orig_sr=sample_rate, target_sr=16000)
sample_rate = 16000
# Specify language
lang = 'swa' # or 'eng' for English
# Transcribe
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_features = processor(
speech_array, sampling_rate=sample_rate, return_tensors="pt"
).input_features
input_features = input_features.to(device)
predicted_ids = model.to(device).generate(
input_features,
language=processor.tokenizer.decode(SALT_LANGUAGE_TOKENS_WHISPER[lang]),
forced_decoder_ids=None
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
Citation
If you use this model in research, please cite:
@misc{sartify2025kinongono,
author = {Sartify LLC},
title = {Kinongono-Whisper-Large-V3: Enhanced ASR for Swahili and English},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/sartifyllc/kinongono-whisper-large-v3}}
}
Contact
For questions, support, or to report issues with this model, please contact:
- Email: [email protected]
- GitHub Issues: https://github.com/sartifyllc/kinongono-asr-interface/issues
- Downloads last month
- 20
Model tree for sartifyllc/kinongono-whisper-large-v3
Base model
openai/whisper-large-v3