Introduction
MeloTTS Vietnamese is a version of MeloTTS optimized for the Vietnamese language. This version inherits the high-quality characteristics of the original model but has been specially adjusted to work well with the Vietnamese language.
Technical Features
- Uses underthesea for Vietnamese text segmentation
- Integrates PhoBert (vinai/phobert-base-v2) to extract Vietnamese language features
- Fully supports Vietnamese language characteristics:
- 45 symbols (phonemes)
- 8 tones (7 tonal marks and 1 unmarked tone)
- All defined in
melo/text/symbols.py
- Text-to-phoneme conversion source:
- Based on Text2PhonemeSequence library
- An improved version with higher performance has been developed at Text2PhonemeFast
Fine-tuning from Base Model
This model was fine-tuned from the base MeloTTS model by:
- Replacing phonemes not found in English and Vietnamese with Vietnamese phonemes
- Specifically replacing Korean phonemes with corresponding Vietnamese phonemes
- Adjusting parameters to match Vietnamese phonetic characteristics
Training Data
- The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
- Note on data quality: This dataset has several limitations including poor voice quality, lack of punctuation, and inaccurate phonetic transcriptions. However, when trained on internal data, the results were much better.
Downloading the Model
The pre-trained model can be downloaded from Hugging Face:
Usage Guide
Data Preparation
The data preparation process is detailed in docs/training.md
. Basically, you need:
- Audio files (recommended to use 44100Hz format)
- Metadata file with the format:
path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001> path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>
Data Preprocessing
To process data, use the command:
python melo/preprocess_text.py --metadata /path/to/text_training.list --config_path /path/to/config.json --device cuda:0 --val-per-spk 10 --max-val-total 500
or use the script melo/preprocess_text.sh
with appropriate parameters.
Using the Model
Refer to the notebook test_infer.ipynb
to learn how to use the model:
# colab_infer.py
from melo.api import TTS
# Speed is adjustable
speed = 1.0
# CPU is sufficient for real-time inference.
# You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps'
device = "cuda:0" # Will automatically use GPU if available
# English
model = TTS(
language="VI",
device=device,
config_path="/path/to/config.json",
ckpt_path="/path/to/G_model.pth",
)
speaker_ids = model.hps.data.spk2id
# Convert text to speech
text = "Nhập văn bản tại đây"
speaker_ids = model.hps.data.spk2id
output_path = "output.wav"
model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=1.0, quiet=True)
Audio Examples
Listen to sample outputs from the model:
Sample Audio
License
This project follows the MIT License, like the original MeloTTS project, allowing use for both commercial and non-commercial purposes.
Acknowledgements
This implementation is based on TTS, VITS, VITS2 and Bert-VITS2. We appreciate their awesome work.