projecte-aina
/

matxa-tts-cat-multispeaker

acoustic modelling

Model card Files Files and versions Community

AlexK-PL commited on Apr 8, 2024

Commit

d79d134

·

verified ·

1 Parent(s): 0a1f685

Update README.md

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -32,21 +32,21 @@ datasets:
 ## Model description
-**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and its average acoustic features.
-And the decoder is essentially a U-Net inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), that is based on Transformers architecture but combined
-with 1D instead of 2D CNNs, making a high reduction on memory consumption and speedy synthesis.
-**Matcha-TTS** is non-autorregressive model and is trained using optimal-transport conditional flow matching (OT-CFM).
-This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching.
 ## Intended uses and limitations
 This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
-It has been finetuned using a Catalan phonemizer, therefore if the model is used in other languages it may will not produce intelligible samples after converting its output
-into a speech waveform.
 The quality of the samples can vary depending on the speaker.
-This may be due to the sensitivity of the model in learning specific frequencies and also due to the samples used for each speaker.
 ## How to use

 ## Model description
+**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and their averaged acoustic features.
+The decoder backbone is essentially a U-Net inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf) based on Transformers architecture. By replacing 2D CNNs by 1D CNNs,
+a large reduction in memory consumption and fast synthesis is achieved.
+**Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
+This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
 ## Intended uses and limitations
 This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
+It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it may will not produce intelligible samples after mapping
+its output into a speech waveform.
 The quality of the samples can vary depending on the speaker.
+This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.
 ## How to use