Kyutai TTS voices
Do you want more voices? Help us by donating your voice or open an issue in the TTS repo to suggest permissively-licensed datasets of voices we could add here.
vctk/
From the Voice Cloning Toolkit dataset, licensed under the Creative Commons License: Attribution 4.0 International.
Each recording was done with two mics, here we used the mic1
recordings.
We chose sentence 23 for every speaker because it's generally the longest one to pronounce.
expresso/
From the Expresso dataset, licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International. Non-commercial use only.
We select clips from the "conversational" files.
For each pair of "kind" and channel (ex04-ex01_laughing
, channel 1),
we find one segment with at least 10 consecutive seconds of speech using VAD_segments.txt
.
We don't include more segments per (kind, channel) to keep the number of voices manageable.
The name of the file indicates how it was selected.
For instance, ex03-ex02_narration_001_channel1_674s.wav
comes from the first audio channel of audio_48khz/conversational/ex03-ex02/narration/ex03-ex02_narration_001.wav
,
meaning it's speaker ex03
.
It's a 10-second clip starting at 674 seconds of the original file.
cml-tts/fr/
French voices selected from the CML-TTS Dataset, licensed under the Creative Commons License: Attribution 4.0 International.
Computing voice embeddings (for Kyutai devs)
uv run {root of `moshi` repo}/scripts/tts_make_voice.py \
--model-root {path to weights dir}/moshi_1e68beda_240/ \
--loudness-headroom 22 \
{root of this repo}