بسم اله الرحمن الرحیم - هست کلید در گنج حکیم

Matcha TTS For persian language

This recepie is for training persian/english tts models (the middle part of the tts, converting ipa phonemes to melspectograms).

The main repo is here.

To do this, you probably need a graphic card with 8GBs of vram for 12 hours or more, supporting pytorch (or use google-colab, kaggle notebook, ...).

Remeber than a classic TTS system consists of these parts:

Vowelizer: Converting text to ipa (International Phonetic Association). Most of the error sin reading(WER) correspond to this part. E-speak library via espeak-ng or piper_phonemizer is usually used for this part.
TTS Model: Converting ipa to melspectograms.
Vocoder: Converts melspectogram diagrams to sound. hifigan is usually used and gives natural results.

Setup environment

sudo apt-get install python3.10-venv
python3.10 -m venv matcha-tts-env
source matcha-tts-env/bin/activate

Install requirements

git clone [email protected]:shivammehta25/Matcha-TTS.git --depth 1
cd Matcha-TTS
pip install -e .

Prepare the dataset

The data structure should be sth like here

Split metadata.csv (dataset text in forms of LJ speech format) to train.txt, val.txt and test.txt files using split_metadata_csv.py

import random

# File paths
input_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/metadata.csv"
wav_folder = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/wav"
train_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/train.txt"
validation_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/val.txt"
test_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/test.txt"

# Read the file as raw text
with open(input_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

# Transform the format
transformed_lines = []
for line in lines:
    file_id, text = line.strip().split("|", 1)  # Split on the first "|"
    transformed_line = f"{wav_folder}/{file_id}.wav|{text}"
    transformed_lines.append(transformed_line)

# Shuffle the data
random.shuffle(transformed_lines)

# Calculate split sizes
total_lines = len(transformed_lines)
train_size = int(0.95 * total_lines)
validation_size = int(0.045 * total_lines)
test_size = total_lines - train_size - validation_size

# Split the data
train_data = transformed_lines[:train_size]
validation_data = transformed_lines[train_size:train_size + validation_size]
test_data = transformed_lines[train_size + validation_size:]

# Save to files
with open(train_file, "w", encoding="utf-8") as f:
    f.write("\n".join(train_data))

with open(validation_file, "w", encoding="utf-8") as f:
    f.write("\n".join(validation_data))

with open(test_file, "w", encoding="utf-8") as f:
    f.write("\n".join(test_data))

print(f"Data split and saved successfully!")
print(f"Train: {len(train_data)} lines")
print(f"Validation: {len(validation_data)} lines")
print(f"Test: {len(test_data)} lines")

Initiate configuration files

copy and edit configs/data/ljspeech.yaml to configs/data/custom.yaml

copy and edit configs/experiment/ljspeech.yaml to configs/experiment/custom.yaml

Inside configs/data/custom.yaml, change:

train_filelist_path: /home/oem/Basir/TTS/Datasets/Phone-Online/Male/train.txt

valid_filelist_path: /home/oem/Basir/TTS/Datasets/Phone-Online/Male/val.txt

Generate normalisation statistics with the yaml file of dataset configuration

./matcha-tts-env/bin/matcha-data-stats -i custom.yaml -f

Output: {'mel_mean': -7.081411, 'mel_std': 3.500973}
Update these values in configs/data/custom.yaml under data_statistics key.

** If freq == 12KHz, it gives warning to reduce n_mels, but hifigan gives error when trying to train such a vocoder, so don't touch n_mels

Manage vram usage

For a minimum (8GB) memory, reduce batch_size in configs/data/custom.yaml: batch_size: 14

(NOT NEEDED for me with 8GB of vram): for a minimal memory, add below to configs/experiment/custom.yaml

model:
  out_size: 172

Set initial checkpoint

Set initial ckpt (or null) by setting ckpt_path in configs/train.yaml

Changes needed in code for persian

In matcha/text/cleaners.py, phonemizer.backend.EspeakBackend part: language="fa",
Run:

pip install piper-phonemize

In cleaners.py:

add below english_cleaners_piper:

import piper_phonemize
def persian_cleaners_piper(text):
    """Pipeline for Persian text, including abbreviation expansion. + punctuation + stress"""
    #text = convert_to_ascii(text)
    text = lowercase(text)
    text = expand_abbreviations(text)
    phonemes = "".join(piper_phonemize.phonemize_espeak(text=text, voice="fa")[0])
    phonemes = collapse_whitespace(phonemes)
    
    # Remove unwanted symbols (e.g., '1')
    unwanted_symbols = {'1', '-'}  # Add any other unwanted symbols here
    filtered_phonemes = "".join([char for char in phonemes if char not in unwanted_symbols])
    
    return filtered_phonemes

Also set cleaner in configs/data/custom.yaml:

cleaners: [persian_cleaners_piper]

Replace symbols.py by:

def read_tokens():
    tokens = []
    with open("/home/oem/Basir/TTS/Matcha/Matcha-TTS/configs/tokens/tokens_sherpa_with_fa.txt", "r", encoding="utf-8") as f:
        for line in f:
            # Remove the newline character at the end
            line = line.rstrip("\n")
            # Split into token and number, preserving whitespace
            if " " in line:
                token = line[:line.index(" ")]  # Extract everything before the first space
                if len(token) == 0: # White-space
                    token = ' '
            else:
                token = line  # If there's no space, the entire line is the token
            tokens.append(token)
    return tokens

symbols = read_tokens()

Change tokens_sherpa_with_fa address with your own one.

In matcha/cli.py change this line to:

    intersperse(text_to_sequence(text, ["persian_cleaners_piper"])[0], 0),

Other changes

For possible errors(due to python updates), change save_figure_to_numpy in matcha/utils.py to:

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import io

def save_figure_to_numpy(fig):
    buf = io.BytesIO()
    fig.savefig(buf, format='png', bbox_inches='tight', pad_inches=0)
    buf.seek(0)
    img = Image.open(buf)
    data = np.array(img)
    buf.close()
    
    return data

To be able to test custom vocoders using command line and testing models trained with frequencies other than 22050, I have made other changes to cli.py.

You can find it in the attached files.

Train!

Run the training script:

python matcha/train.py experiment=custom

Monitor using tensorboard

Goto another bash windows, do:

source matcha-tts-env/bin/activate
cd /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/tensorboard/version_0/
tensorboard --logdir=. --bind_all --port=6007

These commands might also be usefull, they should be run from different windows

watch -n 1 nvidia-smi # to see vram usage
xset dpms force off # to turn of monitor

Test

matcha-tts --text "INPUT TEXT" --checkpoint_path /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/matcha_ljspeech.ckpt --vocoder hifigan_univ_v1 [or hifigan_T2_v1]
matcha-tts --cpu --text "INPUT TEXT" --checkpoint_path /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/checkpoints/last.ckpt --sample_rate 24000 --vocoder hifigan_univ_v1
matcha-tts --file /home/oem/Basir/TTS/HiFi-GAN/MelDataset/metadata_raw.txt --checkpoint_path /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-24000-motahare.ckpt --vocoder /home/oem/Basir/TTS/HiFi-GAN/Trained/MOTAHARE_V1_24KHz/g_00050000 --denoiser_strength 0.00025000 --sample_rate 24000

matcha-tts --cpu --file /home/oem/Basir/TTS/HiFi-GAN/MelDataset/metadata_raw.txt  --checkpoint_path /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/checkpoints/last.ckpt --vocoder /home/oem/Basir/TTS/HiFi-GAN/Trained/MOTAHARE_V1_24KHz/g_00050000 --denoiser_strength 0.00025000 --sample_rate 24000

Note: Remember that default cleaner used by above command is set in matcha/cli.py

Note: Even "--denoiser_strength 0.00025000" (default) has bad effects on quality, use "--denoiser_strength 0.000001". If there is noise, don't try to suppress it, solve the problem! Noise might be of problems in the text-voice mismatch, bad vocoder (for example using hifigan_T2_v1 for male voice or an under-trained hifigan vocoder), noise in dataset itself or not training matcha model for enough time.

Convert to onnx

pip install onnx
python3 -m matcha.onnx.export matcha.ckpt model-5.onnx --n-timesteps 5
python3 -m matcha.onnx.export /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-22050-khadijah.ckpt /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-22050-khadijah-2.onnx --n-timesteps 2

Remeber that the higher the timesteps, the higher the processing time. Even timesteps==1 gives good results.

Add meta-data for sherpa

pip install tokenizer

edit and run add_sherpa_metadata_to_matcha.py

#!/usr/bin/env python3

import json
import os
from typing import Any, Dict
import onnx


def add_meta_data(filename: str, meta_data: Dict[str, Any]):
    """Add meta data to an ONNX model. It is changed in-place.

    Args:
      filename:
        Filename of the ONNX model to be changed.
      meta_data:
        Key-value pairs.
    """
    model = onnx.load(filename)
    for key, value in meta_data.items():
        meta = model.metadata_props.add()
        meta.key = key
        meta.value = str(value)

    onnx.save(model, filename)

def main():
    # Caution: Please change the filename
    filename = "/home/oem/Basir/TTS/Matcha/Trained/onnx/matcha-fa_en-musa-12000-5.onnx"

    print("add model metadata")
    meta_data = {
        "model_type": "matcha-tts",
        "language": "Persian+English",
        "voice": "fa",
        "has_espeak": 1,
        "jieba": 0,
        "n_speakers": 1,
        "sample_rate": 12000,
        "version": 1,
        "pad_id": 0,
        "use_icefall": 0,
        "model_author": "Ali Mahmoudi (@mah92)",
        "maintainer": "k2-fsa",
        "use_eos_bos": 0,
        "num_ode_steps": 5,
        "dataset": "Musa-FA_EN-Public-Phone-Audio-Dataset",
        "dataset_url": "https://huggingface.co/datasets/mah92/Musa-FA_EN-Public-Phone-Audio-Dataset",
        "see_also": "https://github.com/k2-fsa/sherpa-onnx/issues/1779",
    }
    print(meta_data)
    add_meta_data(filename, meta_data)


main()

Note: num_ode_steps in sherpa corresponds to num_steps when converting to onnx.

Test onnx

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnx

Contribute your model

Upload your model in hugging face and add an issue in the sherpa-onnx github repo. They will add your model.

Attention: sherpa-onnx is using T1 hifigan vocoder which is trained on a sinhle female voice. It gets noisy for male voice and high pitched letters. Use vocoders from here instead.

Credits

Special thanks to Masoud Azizi (@Mablue ), Amirreza Ramezani (@brightening-eyes ), and Dr. Hamid Jafari (Khaneh Noor Iranian Basir).

Special thanks to people from @ttsfarsi telegram channel.

I should also thank you @csukuangfj from Xiaomi corporation for your helps and cares in icefall and sherpa-onnx repos.

و ما نحن بشئ الا بما رحم ربنا