--- license: mit datasets: - doof-ferb/infore1_25hours ---

## Introduction MeloTTS Vietnamese is a version of MeloTTS optimized for the Vietnamese language. This version inherits the high-quality characteristics of the original model but has been specially adjusted to work well with the Vietnamese language. ## Technical Features - Uses [underthesea](https://github.com/undertheseanlp/underthesea) for Vietnamese text segmentation - Integrates [PhoBert](https://github.com/VinAIResearch/PhoBERT) (vinai/phobert-base-v2) to extract Vietnamese language features - Fully supports Vietnamese language characteristics: - 45 symbols (phonemes) - 8 tones (7 tonal marks and 1 unmarked tone) - All defined in `melo/text/symbols.py` - Text-to-phoneme conversion source: - Based on [Text2PhonemeSequence](https://github.com/thelinhbkhn2014/Text2PhonemeSequence) library - An improved version with higher performance has been developed at [Text2PhonemeFast](https://github.com/manhcuong02/Text2PhonemeFast) ## Fine-tuning from Base Model This model was fine-tuned from the base MeloTTS model by: - Replacing phonemes not found in English and Vietnamese with Vietnamese phonemes - Specifically replacing Korean phonemes with corresponding Vietnamese phonemes - Adjusting parameters to match Vietnamese phonetic characteristics ## Training Data - The model was trained on the Infore dataset, consisting of approximately 25 hours of speech - Note on data quality: This dataset has several limitations including poor voice quality, lack of punctuation, and inaccurate phonetic transcriptions. However, when trained on internal data, the results were much better. ## Downloading the Model The pre-trained model can be downloaded from Hugging Face: - [MeloTTS Vietnamese on Hugging Face](https://huggingface.co/nmcuong/MeloTTS_Vietnamese) ## Usage Guide ### Data Preparation The data preparation process is detailed in `docs/training.md`. Basically, you need: - Audio files (recommended to use 44100Hz format) - Metadata file with the format: ``` path/to/audio_001.wav ||| path/to/audio_002.wav ||| ``` ### Data Preprocessing To process data, use the command: ```bash python melo/preprocess_text.py --metadata /path/to/text_training.list --config_path /path/to/config.json --device cuda:0 --val-per-spk 10 --max-val-total 500 ``` or use the script `melo/preprocess_text.sh` with appropriate parameters. ### Using the Model Refer to the notebook `test_infer.ipynb` to learn how to use the model: ```python # colab_infer.py from melo.api import TTS # Speed is adjustable speed = 1.0 # CPU is sufficient for real-time inference. # You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps' device = "cuda:0" # Will automatically use GPU if available # English model = TTS( language="VI", device=device, config_path="/path/to/config.json", ckpt_path="/path/to/G_model.pth", ) speaker_ids = model.hps.data.spk2id # Convert text to speech text = "Nhập văn bản tại đây" speaker_ids = model.hps.data.spk2id output_path = "output.wav" model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=1.0, quiet=True) ``` ## Audio Examples Listen to sample outputs from the model: ### Sample Audio

## License This project follows the MIT License, like the original MeloTTS project, allowing use for both commercial and non-commercial purposes. ## Acknowledgements This implementation is based on [TTS](https://github.com/coqui-ai/TTS), [VITS](https://github.com/jaywalnut310/vits), [VITS2](https://github.com/daniilrobnikov/vits2) and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2). We appreciate their awesome work.