adapting speed for model voices
#121
by
kipgon
- opened
I just discovered xtts_v2 (and text to speech) and I have a lot of fun !
It seems that for adapting speed we can't use "tts.tts_to_file(speaker=name, ...)" function and simply give the name of the voice.
It seems we have to use "model.inference(speaker_embedding=embedding, ...)" together with "gpt_cond_latent, embedding = model.get_conditioning_latents(audio_path=speaker_wavs)" with speaker_wavs storing audio files.
My questions are
- did I understand well, there is no way to set speed for predefined voices ?
- how to do it with model.inference : should I produce text to speech sentence with tts.tts_to_file and let's say "Ana Florence" and give the resulting files to model.inference ? If so what kind of dataset should I build (number of files, duration of each files, kind of sentence etc ?)
Bonus questions :
- Generally speaking how to build a dataset for cloning with xtts_v2 ?
- Are there good practices to avoid hallucinations ?