Multi-Speaker VITS Model for Hausa

This is a multi-speaker extension of the MMS-TTS Hausa model from Meta.

Model Details

  • Base model: facebook/mms-tts-hau
  • Number of speakers: 10
  • Model class: MultiSpeakerVITS
  • Language: Hausa (hau)
  • Task: Text-to-Speech (TTS)

Model Architecture

This model extends the original MMS-TTS Hausa model with multi-speaker capabilities by:

  1. Adding speaker embeddings for 10 different speakers
  2. Conditioning the text encoder output with speaker information
  3. Maintaining compatibility with the original VITS architecture

Usage

import torch
from transformers import VitsModel, VitsTokenizer

# Load the base model and tokenizer
base_model = VitsModel.from_pretrained("facebook/mms-tts-hau")
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-hau")

# Load the multi-speaker checkpoint
checkpoint = torch.load("multispeaker_vits_template.pth")

# Define the MultiSpeakerVITS class (copy from the original code)
class MultiSpeakerVITS(torch.nn.Module):
    # ... (copy the class definition from the original code)
    pass

# Create and load the multi-speaker model
ms_model = MultiSpeakerVITS(base_model, n_speakers=10)
ms_model.load_state_dict(checkpoint["model_state"])
ms_model.eval()

# Example usage
text = "Sannu, ina kwana?"  # "Hello, how are you?" in Hausa
inputs = tokenizer(text, return_tensors="pt")
speaker_id = torch.tensor([0])  # Choose speaker 0-9

with torch.no_grad():
    output = ms_model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask"),
        speaker_ids=speaker_id
    )

Training

This is a template model with initialized weights. To use it effectively, you'll need to:

  1. Fine-tune on multi-speaker Hausa data: Train the speaker embeddings and optionally fine-tune the base model
  2. Prepare speaker-labeled dataset: Each audio sample should be labeled with a speaker ID (0 to 9)
  3. Training loop: Implement a training loop that uses both text and speaker_ids as inputs

Files

  • multispeaker_vits_template.pth: PyTorch checkpoint containing model weights
  • config.json: Model configuration and metadata
  • README.md: This documentation

Citation

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap, Vineel and Tjandrawati, Andros and Conneau, Alexis and others},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

License

This model is based on the MMS-TTS model and follows the same licensing terms.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for suleiman2003/mms-hausa-multispeaker-template

Finetuned
(1)
this model