Voice Cloning with Veena
Hi! I am exploring the possibility of implementing voice cloning with Veena model. I understand that it uses fixed speaker tokens (<spk_kavya>
, <spk_agastya>
, etc.) for speaker identity.
Is it possible to add custom speakers beyond the 4 pre-trained ones? Any insights on the feasibility and recommended approach would be greatly appreciated!
Voice cloning capabilities are not currently supported in this open-sourced model. Our self-hosted proprietary model is currently under that development. But for the possibility to add more custom speakers is definitely supported. You can always label your data as per instructions and finetune the model to your usecase "new speakers".
This llama style TTS model needs your samples are to be in this below format.
[START_OF_HUMAN_TOKEN] < spk_name> text_prompt [END_OF_HUMAN_TOKEN] [START_OF_AI_TOKEN] [START_OF_SPEECH_TOKEN] audio_tokens [END_OF_SPEECH_TOKEN] [END_OF_AI_TOKEN]
and these fixed fixed tokens are what determines the model to start speaking and when to stop. Check special tokens in the model card.
Control token IDs (fixed for Veena)
START_OF_SPEECH_TOKEN = 128257
END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266