adapting speed for model voices

#121

by kipgon - opened Jun 30

Jun 30

I just discovered xtts_v2 (and text to speech) and I have a lot of fun !

It seems that for adapting speed we can't use "tts.tts_to_file(speaker=name, ...)" function and simply give the name of the voice.

It seems we have to use "model.inference(speaker_embedding=embedding, ...)" together with "gpt_cond_latent, embedding = model.get_conditioning_latents(audio_path=speaker_wavs)" with speaker_wavs storing audio files.

My questions are

did I understand well, there is no way to set speed for predefined voices ?
how to do it with model.inference : should I produce text to speech sentence with tts.tts_to_file and let's say "Ana Florence" and give the resulting files to model.inference ? If so what kind of dataset should I build (number of files, duration of each files, kind of sentence etc ?)

Bonus questions :

Generally speaking how to build a dataset for cloning with xtts_v2 ?
Are there good practices to avoid hallucinations ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment