myshell-ai%2FMeloTTS | Trendshift

Introduction

MeloTTS Vietnamese is a version of MeloTTS optimized for the Vietnamese language. This version inherits the high-quality characteristics of the original model but has been specially adjusted to work well with the Vietnamese language.

Technical Features

  • Uses underthesea for Vietnamese text segmentation
  • Integrates PhoBert (vinai/phobert-base-v2) to extract Vietnamese language features
  • Fully supports Vietnamese language characteristics:
    • 45 symbols (phonemes)
    • 8 tones (7 tonal marks and 1 unmarked tone)
    • All defined in melo/text/symbols.py
  • Text-to-phoneme conversion source:

Fine-tuning from Base Model

This model was fine-tuned from the base MeloTTS model by:

  • Replacing phonemes not found in English and Vietnamese with Vietnamese phonemes
  • Specifically replacing Korean phonemes with corresponding Vietnamese phonemes
  • Adjusting parameters to match Vietnamese phonetic characteristics

Training Data

  • The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
  • Note on data quality: This dataset has several limitations including poor voice quality, lack of punctuation, and inaccurate phonetic transcriptions. However, when trained on internal data, the results were much better.

Downloading the Model

The pre-trained model can be downloaded from Hugging Face:

Usage Guide

Data Preparation

The data preparation process is detailed in docs/training.md. Basically, you need:

  • Audio files (recommended to use 44100Hz format)
  • Metadata file with the format:
    path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001>
    path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>
    

Data Preprocessing

To process data, use the command:

python melo/preprocess_text.py --metadata /path/to/text_training.list --config_path /path/to/config.json --device cuda:0 --val-per-spk 10 --max-val-total 500

or use the script melo/preprocess_text.sh with appropriate parameters.

Using the Model

Refer to the notebook test_infer.ipynb to learn how to use the model:

# colab_infer.py
from melo.api import TTS

# Speed is adjustable
speed = 1.0

# CPU is sufficient for real-time inference.
# You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps'
device = "cuda:0"  # Will automatically use GPU if available

# English
model = TTS(
    language="VI",
    device=device,
    config_path="/path/to/config.json",
    ckpt_path="/path/to/G_model.pth",
)
speaker_ids = model.hps.data.spk2id

# Convert text to speech
text = "Nhập văn bản tại đây"
speaker_ids = model.hps.data.spk2id
output_path = "output.wav"
model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=1.0, quiet=True)

Audio Examples

Listen to sample outputs from the model:

Sample Audio

License

This project follows the MIT License, like the original MeloTTS project, allowing use for both commercial and non-commercial purposes.

Acknowledgements

This implementation is based on TTS, VITS, VITS2 and Bert-VITS2. We appreciate their awesome work.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nmcuong/MeloTTS_Vietnamese