Multi-Speaker VITS Model for Hausa

This is a multi-speaker extension of the MMS-TTS Hausa model from Meta.

Model Details

Base model: facebook/mms-tts-hau
Number of speakers: 10
Model class: MultiSpeakerVITS
Language: Hausa (hau)
Task: Text-to-Speech (TTS)

Model Architecture

This model extends the original MMS-TTS Hausa model with multi-speaker capabilities by:

Adding speaker embeddings for 10 different speakers
Conditioning the text encoder output with speaker information
Maintaining compatibility with the original VITS architecture

Usage

import torch
from transformers import VitsModel, VitsTokenizer

# Load the base model and tokenizer
base_model = VitsModel.from_pretrained("facebook/mms-tts-hau")
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-hau")

# Load the multi-speaker checkpoint
checkpoint = torch.load("multispeaker_vits_template.pth")

# Define the MultiSpeakerVITS class (copy from the original code)
class MultiSpeakerVITS(torch.nn.Module):
    # ... (copy the class definition from the original code)
    pass

# Create and load the multi-speaker model
ms_model = MultiSpeakerVITS(base_model, n_speakers=10)
ms_model.load_state_dict(checkpoint["model_state"])
ms_model.eval()

# Example usage
text = "Sannu, ina kwana?"  # "Hello, how are you?" in Hausa
inputs = tokenizer(text, return_tensors="pt")
speaker_id = torch.tensor([0])  # Choose speaker 0-9

with torch.no_grad():
    output = ms_model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask"),
        speaker_ids=speaker_id
    )

Training

This is a template model with initialized weights. To use it effectively, you'll need to:

Fine-tune on multi-speaker Hausa data: Train the speaker embeddings and optionally fine-tune the base model
Prepare speaker-labeled dataset: Each audio sample should be labeled with a speaker ID (0 to 9)
Training loop: Implement a training loop that uses both text and speaker_ids as inputs

Files

multispeaker_vits_template.pth: PyTorch checkpoint containing model weights
config.json: Model configuration and metadata
README.md: This documentation

Citation

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap, Vineel and Tjandrawati, Andros and Conneau, Alexis and others},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

License

This model is based on the MMS-TTS model and follows the same licensing terms.

suleiman2003
/

mms-hausa-multispeaker-template