Update README.md
Browse files
README.md
CHANGED
@@ -32,21 +32,21 @@ datasets:
|
|
32 |
|
33 |
## Model description
|
34 |
|
35 |
-
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and
|
36 |
-
|
37 |
-
|
38 |
|
39 |
-
**Matcha-TTS** is non-autorregressive model
|
40 |
-
This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching.
|
41 |
|
42 |
## Intended uses and limitations
|
43 |
|
44 |
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
|
45 |
-
It has been finetuned using a Catalan phonemizer, therefore if the model is used
|
46 |
-
into a speech waveform.
|
47 |
|
48 |
The quality of the samples can vary depending on the speaker.
|
49 |
-
This may be due to the sensitivity of the model in learning specific frequencies and also due to the samples
|
50 |
|
51 |
## How to use
|
52 |
|
|
|
32 |
|
33 |
## Model description
|
34 |
|
35 |
+
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and their averaged acoustic features.
|
36 |
+
The decoder backbone is essentially a U-Net inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf) based on Transformers architecture. By replacing 2D CNNs by 1D CNNs,
|
37 |
+
a large reduction in memory consumption and fast synthesis is achieved.
|
38 |
|
39 |
+
**Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
|
40 |
+
This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
|
41 |
|
42 |
## Intended uses and limitations
|
43 |
|
44 |
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
|
45 |
+
It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it may will not produce intelligible samples after mapping
|
46 |
+
its output into a speech waveform.
|
47 |
|
48 |
The quality of the samples can vary depending on the speaker.
|
49 |
+
This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.
|
50 |
|
51 |
## How to use
|
52 |
|