TWB Voice Hausa Multi-Speaker TTS 1.0

Model Description

This is a multi-speaker Text-to-Speech (TTS) model for Hausa language, fine-tuned from the CML-TTS multilingual checkpoint using the YourTTS architecture. The model supports three distinct speakers: one female (spk_f_1) and two male speakers (spk_m_1, spk_m_2) and can generate high-quality Hausa speech from text input.

Model Details

Model Architecture: YourTTS (VITS-based)
Base Model: CML-TTS Dataset multilingual checkpoint
Language: Hausa (ha, hau)
Sample Rate: 24 kHz
Speakers: 3 (spk_f_1: female, spk_m_1: male, spk_m_2: male)
Model Type: Multi-speaker neural TTS
Framework: Coqui TTS

Model Architecture Details

Text Encoder: 10-layer transformer with 2 attention heads
Hidden Channels: 192
FFN Hidden Channels: 768
Decoder: HiFi-GAN with ResBlock type 2
Flow Layers: 4 coupling layers
Posterior Encoder: 16-layer WaveNet
Speaker Embedding: 512-dimensional d-vectors
Language Embedding: 4-dimensional language embeddings

Training Data

The model was trained on approximately 30 hours of Hausa speech data from two high-quality sources:

Female Speaker (spk_f_1)

Source: TWB Voice project
Dialect: Kenanci
Duration: ~10 hours
Sample Dataset: TWB-voice-TTS-Hausa-1.0-sampleset
Description: High-quality female voice recordings collected within the TWB Voice 1.0 project.

Male Speakers (spk_m_1, spk_m_2)

Source: Biblica open.bible project
Duration: ~10 hours each (20 hours total)
Origin: open.bible
Description: Bible recordings featuring two consistent male speakers

Data Preprocessing

Original Sample Rate: 48 kHz → Target: 24 kHz (downsampled)
Audio Format: Mono WAV files
Text Processing: Lowercase conversion, diacritics preserved
Quality Filters:
- Duration: 0.5s - 20s
- Text length: minimum 10 characters
Train/Dev Split: 95% train, 5% validation

Character Set

The model supports standard Hausa orthography including diacritics:

Characters: abcdefghijklmnopqrstuvwxyzăāɓɗƙƴū

Punctuation: !'(),-.:;? and space

Training

The model was fine-tuned for 1000 epochs with the following configuration:

Batch Size: 12
Learning Rate: 0.0001 (generator and discriminator)
Mixed Precision: FP16
Optimizer: AdamW
Loss Components: Mel loss (α=45.0), Speaker encoder loss (α=9.0)
GPU Setup: 2x NVIDIA GeForce RTX 2080

Evaluation Results

The model was evaluated on a set of 30 Hausa sentences by human evaluators using two criteria:

Sample Evaluation Sentences:

"lokacin damuna shuka kan koriya shar."
"lafiyarku tafi kuɗinku muhimmanci."
"a kiyayi inda ake samun labarun magani ko kariya da cututtuka."

Evaluation Metrics:

Pronunciation Accuracy (1-5 scale): 2.52 average
Speech Naturalness (1-5 scale): 2.93 average

Usage

Installation

pip install coqui-tts torch scipy numpy huggingface_hub

Quick Start

from TTS.api import TTS
from huggingface_hub import hf_hub_download
import json
import tempfile
import scipy.io.wavfile as wavfile
import numpy as np
import os

# Download and setup model files
config_path = hf_hub_download("CLEAR-Global/TWB-Voice-Hausa-TTS-1.0", "config.json")
with open(config_path, 'r') as f:
    config = json.load(f)

# Download required files
model_path = hf_hub_download("CLEAR-Global/TWB-Voice-Hausa-TTS-1.0", "best_model_498283.pth")
speakers_file = hf_hub_download("CLEAR-Global/TWB-Voice-Hausa-TTS-1.0", "speakers.pth")
language_ids_file = hf_hub_download("CLEAR-Global/TWB-Voice-Hausa-TTS-1.0", "language_ids.json")
d_vector_file = hf_hub_download("CLEAR-Global/TWB-Voice-Hausa-TTS-1.0", "d_vector.pth")
config_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Hausa-TTS-1.0", "config_se.json")
model_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Hausa-TTS-1.0", "model_se.pth")

# Update config paths
config["speakers_file"] = speakers_file
config["language_ids_file"] = language_ids_file
config["d_vector_file"] = [d_vector_file]
config["model_args"]["speakers_file"] = speakers_file
config["model_args"]["language_ids_file"] = language_ids_file
config["model_args"]["d_vector_file"] = [d_vector_file]
config["model_args"]["speaker_encoder_config_path"] = config_se_file
config["model_args"]["speaker_encoder_model_path"] = model_se_file

# Save updated config
temp_config = tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False)
json.dump(config, temp_config, indent=2)
temp_config.close()

# Load TTS model
tts = TTS(model_path=model_path, config_path=temp_config.name)

# Generate speech
text = "lokacin damuna shuka kan koriya shar."
speaker = "spk_f_1"  # Options: spk_f_1 (female), spk_m_1 (male), spk_m_2 (male)

wav = tts.synthesizer.tts(text=text.lower(), speaker_name=speaker)
wav_array = np.array(wav, dtype=np.float32)
wavfile.write("output.wav", tts.synthesizer.output_sample_rate, wav_array)

Batch Inference

For batch processing multiple sentences, use the provided inference script:

python3 inference.py /path/to/model.pth
# or
python3 inference.py /path/to/model/directory

The script will generate audio for all evaluation sentences with all available speakers.

Model Limitations

Language: Only supports Hausa language
Input Format: Requires lowercase text input
Speakers: Limited to 3 pre-trained speaker identities
Domain: Trained primarily on religious/educational content and general speech
Code-switching: Not optimized for mixed language input but the training data does include English names and places

Ethical Considerations

Consent: All training data used with appropriate permissions
Bias: Model reflects the speech patterns and characteristics of the specific speakers in training data
Cultural Sensitivity: Trained on religious and educational content; may not capture all dialectal variations
Use Cases: Intended for educational, accessibility, and content creation purposes
Non-Commercial: This model is released for non-commercial use only

Licensing

This model is released under CC-BY-NC license. For commercial licensing or other uses, please contact [email protected].

Citation

If you use this model in your research or applications, please cite:

@misc{yourtts-hausa-2025,
  title={YourTTS Hausa Multi-Speaker Text-to-Speech Model},
  author={Alp Öktem},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/your-username/yourtts-hausa-multispeaker}
}

Acknowledgments

This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation. We acknowledge the following open source projects and resources that made this model possible:

Idiap Coqui TTS: For the YourTTS architecture and training framework
CML-TTS Dataset: For the multilingual base model
TWB Voice Project: For high-quality Hausa voice data
Biblica open.bible: For Bible recording contributions

Model Card Authors

Alp Öktem ([email protected])

CLEAR-Global
/

TWB-Voice-Hausa-TTS-1.0