Text-to-Speech
coqui

adapting speed for model voices

#121
by kipgon - opened

I just discovered xtts_v2 (and text to speech) and I have a lot of fun !

It seems that for adapting speed we can't use "tts.tts_to_file(speaker=name, ...)" function and simply give the name of the voice.

It seems we have to use "model.inference(speaker_embedding=embedding, ...)" together with "gpt_cond_latent, embedding = model.get_conditioning_latents(audio_path=speaker_wavs)" with speaker_wavs storing audio files.

My questions are

  1. did I understand well, there is no way to set speed for predefined voices ?
  2. how to do it with model.inference : should I produce text to speech sentence with tts.tts_to_file and let's say "Ana Florence" and give the resulting files to model.inference ? If so what kind of dataset should I build (number of files, duration of each files, kind of sentence etc ?)

Bonus questions :

  1. Generally speaking how to build a dataset for cloning with xtts_v2 ?
  2. Are there good practices to avoid hallucinations ?

Sign up or log in to comment