Voice cloning capabilities are not currently supported in this open-sourced model. Our self-hosted proprietary model is currently under that development. But for the possibility to add more custom speakers is definitely supported. You can always label your data as per instructions and finetune the model to your usecase "new speakers".

This llama style TTS model needs your samples are to be in this below format.

[START_OF_HUMAN_TOKEN] < spk_name> text_prompt [END_OF_HUMAN_TOKEN] [START_OF_AI_TOKEN] [START_OF_SPEECH_TOKEN] audio_tokens [END_OF_SPEECH_TOKEN] [END_OF_AI_TOKEN]

and these fixed fixed tokens are what determines the model to start speaking and when to stop. Check special tokens in the model card.

Control token IDs (fixed for Veena)

START_OF_SPEECH_TOKEN = 128257
END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266

maya-research
/

Veena

Voice Cloning with Veena

Control token IDs (fixed for Veena)