@Jaward on Hugging Face: "Thrilled to share our latest work: Voila

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

Jaward

posted an update May 6

Post

688

Thrilled to share our latest work: Voila - a family of fully opensourced voice models for real-time autonomous convos and role-play, some of our major contributions include 🧵:
1) An End-to-End Full-Duplex Arch: that directly processes & handles simultaneous audio token streams from user to model and vice versa.
2) Voila-Tokenizer: A 100K-hour trained tokenizer with interleaved alignment (audio & text) that distills semantic/acoustic tokens via RVQ.
3) Text-Audio Interleaved Alignment: We leveraged a fine-grained alignment of text and audio tokens that allows synchronization and expressiveness for tasks like ASR (WER 2.7%) and TTS (WER 2.8%).
4) Voice Customization: Supports 1M+ pre-built voices and 1 shot voice clone from 10s audio clips using Wespeaker embeddings.

Models: maitrix-org/voila-67e0d96962c19f221fc73fa5
Code: https://github.com/maitrix-org/Voila
Demo: maitrix-org/Voila-demo
Project page: maitrix-org/Voila-demo

Jaward

May 6

if you like this work, kindly upvote the paper, thanks: https://huggingface.co/papers/2505.02707

prithivMLmods

May 6

•

edited May 6

Awesome work, guys! 🔥

In this post