Update README.md
Browse files
README.md
CHANGED
|
@@ -32,21 +32,21 @@ datasets:
|
|
| 32 |
|
| 33 |
## Model description
|
| 34 |
|
| 35 |
-
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and
|
| 36 |
-
|
| 37 |
-
|
| 38 |
|
| 39 |
-
**Matcha-TTS** is non-autorregressive model
|
| 40 |
-
This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching.
|
| 41 |
|
| 42 |
## Intended uses and limitations
|
| 43 |
|
| 44 |
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
|
| 45 |
-
It has been finetuned using a Catalan phonemizer, therefore if the model is used
|
| 46 |
-
into a speech waveform.
|
| 47 |
|
| 48 |
The quality of the samples can vary depending on the speaker.
|
| 49 |
-
This may be due to the sensitivity of the model in learning specific frequencies and also due to the samples
|
| 50 |
|
| 51 |
## How to use
|
| 52 |
|
|
|
|
| 32 |
|
| 33 |
## Model description
|
| 34 |
|
| 35 |
+
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and their averaged acoustic features.
|
| 36 |
+
The decoder backbone is essentially a U-Net inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf) based on Transformers architecture. By replacing 2D CNNs by 1D CNNs,
|
| 37 |
+
a large reduction in memory consumption and fast synthesis is achieved.
|
| 38 |
|
| 39 |
+
**Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
|
| 40 |
+
This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
|
| 41 |
|
| 42 |
## Intended uses and limitations
|
| 43 |
|
| 44 |
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
|
| 45 |
+
It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it may will not produce intelligible samples after mapping
|
| 46 |
+
its output into a speech waveform.
|
| 47 |
|
| 48 |
The quality of the samples can vary depending on the speaker.
|
| 49 |
+
This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.
|
| 50 |
|
| 51 |
## How to use
|
| 52 |
|