projecte-aina
/

matxa-tts-cat-multispeaker

acoustic modelling

Model card Files Files and versions Community

AlexK-PL commited on Apr 8, 2024

Commit

5c145f9

·

verified ·

1 Parent(s): d79d134

Update README.md

Files changed (1) hide show

README.md +23 -5

README.md CHANGED Viewed

@@ -32,9 +32,10 @@ datasets:
 ## Model description
-**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and their averaged acoustic features.
-The decoder backbone is essentially a U-Net inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf) based on Transformers architecture. By replacing 2D CNNs by 1D CNNs,
-a large reduction in memory consumption and fast synthesis is achieved.
 **Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
 This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
@@ -52,10 +53,20 @@ This may be due to the sensitivity of the model in learning specific frequencies
 ### Installation
 ```bash
-pip install git+https://github.com/langtech-bsc/vocos.git@matcha
 ```
-You need to install the Catalan phonemizer version of espeak-ng:
 ```bash
 git clone https://github.com/projecte-aina/espeak-ng.git
@@ -72,6 +83,13 @@ pip install mecab-python3
 pip install unidic-lite
 ```
 ### Generate

 ## Model description
+**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
+The encoder part is based on a text encoder and a phoneme duration prediction that together predict averaged acoustic features.
+And the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
+In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
 **Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
 This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
 ### Installation
+This model has been trained using the espeak-ng open source text-to-speech software.
+The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng)
+Create a virtual environment:
+```bash
+python -m venv /path/to/venv
+```
 ```bash
+source /path/to/venv/bin/activate
 ```
+For training and inferencing with Catalan Matcha-TTS you need to compile the provided espeak-ng with the Catalan phonemizer:
 ```bash
 git clone https://github.com/projecte-aina/espeak-ng.git
 pip install unidic-lite
 ```
+Install the repository:
+```bash
+pip install git+https://github.com/langtech-bsc/Matcha-TTS.git@dev-cat
+```
 ### Generate