omogr/xtts-ru-ipa · How many data

Sep 23, 2024

Hi, i see you replace old vocab with new russian IPA vocab. How many data to training this model, thank you

Owner Sep 25, 2024

Hello. The library contains one statistical model (for generating IPA transcription) and two bert models for accentuation. To train the statistical model, I used mainly words from wiktionary and wikipedia, (the Russian version of which contains an IPA transcription). To train bert, I used ~3 GB of text data, for which the correct accents were placed for ambiguous words. I am currently working on increasing the amount of training data in order to more accurately resolve ambiguities in the accentuation.

anhnct

Sep 25, 2024

Thank you for your reply. What I mean is how much audio data do you use for training xtts

omogr

Owner Sep 25, 2024

•

edited Sep 25, 2024

I'm sorry, I should have guessed. It was a small experiment, just to understand how it makes sense to use transcription and accents for speech synthesis. I used ~60 hours of speech for training. In the README, I referred to the acoustic data that I used for training. https://github.com/omogr/omogre/blob/main/README_eng.md. The model was trained on the RUSLAN and Common Voice datasets.

https://ruslan-corpus.github.io/
https://commonvoice.mozilla.org/ru

anhnct

Sep 25, 2024

Thank you

JaspertTms

Apr 28

Did you freeze the original layers to train new model weights, or did you fine-tune using the default weights?

Regarding your model: the stress on the letter "И" (in Russian) is not always placed correctly, and the letter "Ч" is pronounced like "Ш". Also, the pauses between sentences should be slightly longer.

Additionally, you should add number processing to the code, converting numbers into words — currently, the library simply removes them.

omogr

Owner May 1

•

edited May 2

Thank you for your thoughtful feedback on my library! I appreciate you taking the time to share your observations.

Model Training Approach

"Did you freeze the original layers to train new model weights, or did you fine-tune using the default weights?"

The model was fine-tuned using the default weights without freezing the original layers.

Pronunciation Observations

"The stress on 'И' isn't always correct, 'Ч' sounds like 'Ш', and pauses between sentences could be longer."

I haven't observed these systematic errors in my testing, but I'd be very interested to investigate specific phrases where you've encountered these issues. The model's pronunciation is heavily influenced by the reference audio used during inference - different reference files can produce noticeably different articulation patterns. If you could share examples of problematic sentences (and optionally a reference audio that demonstrates your desired pronunciation), I'd be happy to explore this further.

Text Normalization (Numbers/Symbols)

"Add number-to-words conversion instead of removing them."

You're absolutely right that robust text preprocessing should handle numbers, abbreviations, special symbols, and mixed-language text. However, implementing comprehensive text normalization is nontrivial and highly domain-dependent (e.g., dates/currency/units require context-aware conversion).

For this implementation, I consciously decided to focus on core TTS functionality while leveraging existing specialized libraries for text normalization. Solutions like ruNorm (Russian-specific) or multilingual tools like those in the XTTS framework could serve as viable starting points. While some implementations might appear overly complex due to their multilingual support, they could be adapted for specific use cases through targeted modifications.

JaspertTms

May 3

•

edited May 4

Here are some examples:
"Нечто" (Nechto): It should be [nʲetɕtə] (pronounced like "nechta" with a clear "ch"), but the transliterator outputs [nʲeʂtə] (like "neshta" with "sh"). While this might sound similar during fast reading, no one actually says "neshto" today — it's incorrect.
"Аура" (Aura): It should be [aʊrə] (with the diphthong "au"), but the transliterator gives [arə] (like "ara"), and [aurə] sounds like "ura".
The issue is the absence of tokens for [tɕtə] (for the "chte" sound) and [aʊ] (for the "au" diphthong).
We need transliteration that reflects the orthographic pronunciation, not just the phonetic one.
There are many similar errors. Is there a transliteration model dictionary where corrections can be manually added? That would help fix these mistakes and improve the output quality.

Ну или на русском
Вот примеры:
«Нечто»: Должно быть [nʲetɕtə] (произносится как «нечта» с чётким «ч»), но транслитератор выдает [nʲeʂtə] (как «нешта» с «ш»). Понятно, что при беглом чтении это звучит похоже, но «нешто» сейчас никто так не говорит — это ошибка.
«Аура»: Должно быть [aʊrə] (с дифтонгом «ау»), но транслитератор даёт [arə] (как «ара»), а [aurə] звучит как «ура».
Проблема в отсутствии токенов для [tɕtə] (для звука «чтэ») и [aʊ] (для дифтонга «ау»).
Яо Чанъин - [ao tɕɪnʲin] О ченин как я не петался яо он таки не выдал.
Нужно, чтобы транслитерация отражала орфографическое, а не только фонетическое произношение.
Таких ошибок много. Возможно, существует словарь модели транслитерации, куда можно вносить правки? Это позволило бы исправлять ошибки вручную и улучшить качество вывода.

omogr

Owner May 11

Thank you so much, I totally agree, yes, this needs to be fixed. I'll figure it out. It might take me some time.

omogr

Owner May 16

I’ve updated the dictionaries, and words like "нечто," "аура," and others should now transcribe correctly.

To update the data, please run:

python -m download_data

Example usage:

from omogre import Transcriptor  

transcriptor = Transcriptor()  
print(transcriptor(['нечто', 'аура']))  
# Output: ['nʲ`etɕtə', '`aʊrə']

If you need to download the data to a specific directory, use:

python -m download_data --data_path your_data_directory_path

Then initialize the Transcriptor with the custom path:

from omogre import Transcriptor  

transcriptor = Transcriptor(data_path='your_data_directory_path')  
print(transcriptor(['нечто', 'аура']))

Let me know if you encounter any further issues!

omogr

Owner May 17

If a custom dictionary is needed, it can be added during the text preprocessing stage. Such a stage is necessary in any case, as it handles numbers, abbreviations, words in other languages, special symbols, and similar elements. At this stage, words requiring special transcription can be replaced in the text with their transcriptions. Here’s an example of how this might look:

from omogre import Transcriptor

transcriptor = Transcriptor(punct=None)
print(transcriptor(['некий ao tɕɪnʲinnn']))

The punct=None parameter ensures that the transcriptor does not remove unfamiliar symbols from the text but leaves them as-is.

JaspertTms

May 19

Have you tried increasing the model's limit for the Russian language? The limit of 182 seems very odd. With IPA, it should be possible to increase it to at least 300-400. And which model version is it, 2.0.2 or 2.0.3?

There is simply a difference in pronunciations and artifacts

omogr

Owner May 21

Thank you for your question!

If you're referring to the XTTS model — during training, the goal was to test how useful IPA transcription is for synthesis. For this reason, everything else was kept as standard as possible.

The model was downloaded using the script from the XTTS repository:
https://github.com/coqui-ai/TTS/blob/dev/TTS/demos/xtts_ft_demo/utils/gpt_train.py
It’s likely that the URLs hardcoded in that file correspond to an earlier version of the model (probably 2.0.1 instead of 2.0.3).

To run the model, I used standard XTTS code:
https://github.com/omogr/omogre/blob/main/XTTS_ru_ipa.ipynb
I haven’t tried to develop or maintain a custom version of the XTTS code with modified length constraints or other changes — my focus was on testing IPA-based training within the standard framework.

Let me know if you have any further questions!

JaspertTms

May 21

•

edited May 21

I think the model's tokenizer limits the work of IPA. It needs to be modified, and the text length should be increased; otherwise, the artifacts at the end of sentences are unlikely to disappear. Some kind of GPT-2 model is used there, and I'm afraid it's because of it. With IPA, it was possible to improve the voice quality, but other aspects still haven't disappeared. I think it's all about it. It kind of cuts off sentences, adding the sound "pa" at the end of a sentence, even though the model itself wants to continue generation with a standard of 250 characters for other languages, especially at ".". You can notice it. There is also an idea to output IPA as a separate language. I think the old model data will be updated, and the generation will improve, probably.