KaniTTS
A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications.
Overview
KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency.
Key Specifications:
- Model Size: 370M parameters
- Sample Rate: 22kHz
- Languages: English, German, Chinese, Korean, Arabic, Spanish
- License: Apache 2.0
Performance
Nvidia RTX 5080 Benchmarks:
- Latency: ~1 second to generate 15 seconds of audio
- Memory: 2GB GPU VRAM
- Quality Metrics: MOS 4.3/5 (naturalness), WER <5% (accuracy)
Pretraining:
- Dataset: ~80k hours from LibriTTS, Common Voice, and Emilia
- Hardware: 8x H100 GPUs, 45 hours training time on Lambda AI
Voices Datasets
- https://huggingface.co/datasets/nytopop/expresso-conversational
- https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech
- https://huggingface.co/datasets/jazza234234/david-dataset
- https://huggingface.co/datasets/reach-vb/jenny_tts_dataset
- https://huggingface.co/datasets/MBZUAI/ArVoice
- https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full
- https://huggingface.co/datasets/SinclairSchneider/german_voice_cb
- https://huggingface.co/datasets/Bingsu/KSS_Dataset
- https://huggingface.co/datasets/ciempiess/ciempiess_fem
- https://huggingface.co/datasets/TingChen-ppmc/Shanghai_Dialect_TTS_openai
- https://huggingface.co/datasets/boniromou/zh-yue-tts-dataset
- https://huggingface.co/datasets/zeeshanparvez/andrew-v3
Voices:
david
— David, English (British)puck
— Puck, English (Gemini)kore
— Kore, English (Gemini)andrew
— Andrew, Englishjenny
— Jenny, English (Irish)simon
— Simon, Englishkatie
— Katie, Englishseulgi
— Seulgi, Koreanbert
— Bert, Germanthorsten
— Thorsten, German (Hessisch)maria
— Maria, Spanishmei
— Mei, Chinese (Cantonese)ming
— Ming, Chinese (Shanghai OpenAI)karim
— Karim, Arabicnur
— Nur, Arabic
Audio Examples
Text | Audio |
---|---|
I do believe Marsellus Wallace, MY husband, YOUR boss, told you to take me out and do WHATEVER I WANTED. | |
What do we say to the god of death? Not today! | |
What do you call a lawyer with an IQ of 60? Your honor | |
You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you? |
Use Cases
- Conversational AI: Real-time speech for chatbots and virtual assistants
- Edge/Server Deployment: Resource-efficient inference on affordable hardware
- Accessibility: Screen readers and language learning applications
- Research: Fine-tuning for specific voices, accents, or emotions
Limitations
- Performance degrades with inputs exceeding 2000 tokens
- Limited expressivity without fine-tuning for specific emotions
- May inherit biases from training data in prosody or pronunciation
- Optimized primarily for English; other languages may require additional training
Optimization Tips
- Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec
- Batch Processing: Use batches of 8-16 for high-throughput scenarios
- Hardware: Optimized for NVIDIA Blackwell architecture GPUs
Resources
Models:
Examples:
Links:
Acknowledgments
Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing.
Responsible Use
Prohibited activities include:
- Illegal content or harmful, threatening, defamatory, or obscene material
- Hate speech, harassment, or incitement of violence
- Generating false or misleading information
- Impersonating individuals without consent
- Malicious activities such as spamming, phishing, or fraud
By using this model, you agree to comply with these restrictions and all applicable laws.
Contact
Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.
Citation
@misc {sb_2025,
author = { SB },
title = { gemini-flash-2.0-speech },
year = 2025,
url = { https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech },
doi = { 10.57967/hf/4237 },
publisher = { Hugging Face }
}
@misc{toyin2025arvoicemultispeakerdatasetarabic,
title={ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis},
author={Hawau Olamide Toyin and Rufael Marew and Humaid Alblooshi and Samar M. Magdy and Hanan Aldarmaki},
year={2025},
eprint={2505.20506},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20506},
}
@misc {thorsten_müller_2024,
author = { {Thorsten Müller} },
title = { TV-44kHz-Full (Revision ff427ec) },
year = 2024,
url = { https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full },
doi = { 10.57967/hf/3290 },
publisher = { Hugging Face }
}
@misc{carlosmenaciempiessfem2019,
title={CIEMPIESS FEM CORPUS: Audio and Transcripts of Female Speakers in Spanish.},
ldc_catalog_no={LDC2019S07},
DOI={https://doi.org/10.35111/xdx5-n815},
author={Hernandez Mena, Carlos Daniel},
journal={Linguistic Data Consortium, Philadelphia},
year={2019},
url={https://catalog.ldc.upenn.edu/LDC2019S07},
}
- Downloads last month
- 197
Model tree for nineninesix/kani-tts-370m
Base model
nineninesix/kani-tts-450m-0.2-pt