Model Card for Kinongono-Whisper-Large-V3

Model Details

Model Description

Kinongono-Whisper-Large-V3 is a fine-tuned version of OpenAI's Whisper Large-V3 model, specifically adapted for enhanced speech recognition in Swahili and English using the SALT framework (Swahili and Associated Languages Techniques). This model demonstrates robust capabilities in transcribing spoken content from audio in both languages with high accuracy.

Developed by: Sartify LLC
Funded by: Sartify LLC
Shared by: Sartify LLC
Model type: Speech-to-Text Automatic Speech Recognition (ASR)
Language(s): English (eng) and Swahili (swa)
License: MIT License
Finetuned from model: OpenAI's Whisper Large-V3

Model Sources

Repository: GitHub repository
Model on Hugging Face: sartifyllc/kinongono-whisper-large-v3

Uses

Direct Use

This model is designed for transcribing spoken audio in Swahili and English. It performs particularly well for:

Transcription of natural conversations
Speech recognition for media content
Creating subtitles and captions
Documentation of spoken recordings
Language preservation and archiving

Downstream Use

The model can be integrated into:

Voice assistants with Swahili language support
Educational tools for language learning
Content creation workflows
Accessibility solutions for hearing-impaired users
Research applications for linguistics

Out-of-Scope Use

This model is not designed for:

Speaker identification or voice biometrics
Emotion detection from speech
Real-time transcription with minimal latency requirements
Languages other than English and Swahili

Bias, Risks, and Limitations

The model may exhibit varying levels of accuracy depending on dialects, accents, and regional variations of Swahili.
Background noise, poor audio quality, or overlapping speakers may reduce transcription accuracy.
The model may not accurately transcribe specialized terminology or uncommon proper nouns.
Speech recognition models inherently carry risks of misrepresenting spoken content, which could have consequences in critical applications like legal or medical documentation.

Training Details

Training Data

The model was fine-tuned on a diverse dataset of Swahili and English audio recordings, including:

Common Voice datasets
Specially curated Swahili speech recordings
Additional proprietary data collected with speaker consent

Training Procedure

Fine-tuning Framework: Transformers by Hugging Face
Base Model: OpenAI's Whisper Large-V3
Training Hardware: NVIDIA A100 GPUs
Training Approach: Fine-tuning with specialized attention to Swahili phonetics and linguistic structures

Evaluation

Testing Data, Factors & Metrics

The model was evaluated on a separate test set of diverse Swahili and English recordings, measuring:

Word Error Rate (WER)
Character Error Rate (CER)
Transcription latency
Performance across different accents and dialects

Results

The model achieves a WER of approximately 15% on Swahili content and 11% on English content.
Performance varies based on audio quality, with optimal results on clear recordings without background noise.
The model shows robustness to mild accents but may struggle with heavy regional variations.

Environmental Impact

Estimated Carbon Emissions: The fine-tuning process required approximately [X] GPU hours, resulting in an estimated [Y] kg of CO₂ emissions.
Hardware Type: NVIDIA A100 GPUs
Location: East Africa data centers

Technical Specifications

Model Architecture and Objective

This model utilizes the Whisper architecture, which is based on an encoder-decoder Transformer. The objective is accurate transcription of spoken language to text.

Input Format

Audio in WAV or MP3 format
16kHz sample rate recommended
Various durations supported, optimal performance on 5-30 second segments

Output Format

Plain text transcription
Option to include timestamps for longer content

How to Use

import transformers
import torch
import librosa

# Load model and processor
processor = transformers.WhisperProcessor.from_pretrained("sartifyllc/kinongono-whisper-large-v3")
model = transformers.WhisperForConditionalGeneration.from_pretrained("sartifyllc/kinongono-whisper-large-v3")

# Language tokens for Whisper
SALT_LANGUAGE_TOKENS_WHISPER = {
    'eng': 50259,  # English
    'swa': 50318,  # Swahili
}

# Load and preprocess audio
speech_array, sample_rate = librosa.load("your_audio_file.wav", sr=None)
speech_array = librosa.resample(speech_array, orig_sr=sample_rate, target_sr=16000)
sample_rate = 16000

# Specify language
lang = 'swa'  # or 'eng' for English

# Transcribe
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_features = processor(
    speech_array, sampling_rate=sample_rate, return_tensors="pt"
).input_features
input_features = input_features.to(device)
predicted_ids = model.to(device).generate(
    input_features,
    language=processor.tokenizer.decode(SALT_LANGUAGE_TOKENS_WHISPER[lang]),
    forced_decoder_ids=None
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

Citation

If you use this model in research, please cite:

@misc{sartify2025kinongono,
  author = {Sartify LLC},
  title = {Kinongono-Whisper-Large-V3: Enhanced ASR for Swahili and English},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/sartifyllc/kinongono-whisper-large-v3}}
}

Contact

For questions, support, or to report issues with this model, please contact:

Email: [email protected]
GitHub Issues: https://github.com/sartifyllc/kinongono-asr-interface/issues

sartifyllc
/

kinongono-whisper-large-v3