BeigeTTS: Research Release for Neural Speech Synthesis
Overview
BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.
Research Context & Motivation
BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:
- Multi-speaker voice cloning (10,000+ voices)
- Real-time multilingual synthesis (57 languages)
- Emotion and prosody transfer
- Sub-50ms streaming latency
- Production-grade robustness
BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.
Technical Architecture
Model Foundation
- Base Model: Google Gemma-3 4B Instruct
- Parameter Count: ~4 billion parameters (Khaki uses 70B+)
- Audio Codec: NeuCodec (24kHz, single codebook)
- Training Steps: 1,435,000 steps
- Context Length: 2048 tokens
- Vocabulary Size: Extended to 327,690 tokens (includes NeuCodec token space)
Research Implications
This release enables researchers to explore:
- Unified Text-Audio Modeling: How large language models can be adapted for audio generation tasks
- Token-Based Audio Synthesis: Advantages of discrete token representations over continuous methods
- Efficient Streaming: Real-time generation with minimal latency
- Cross-Modal Learning: Transfer learning between text and audio modalities
Token Space Design
The model employs a unified token space combining text and audio:
Standard Gemma Tokens: 0-262,144
Special Audio Markers:
- AUDIO_START: 262,145
- AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)
Capabilities & Limitations
Current Capabilities (BeigeTTS)
- High-quality English speech synthesis
- Natural prosody and intonation
- Streaming generation support
- Adjustable speaking rate and style
- Context-aware generation
Production Capabilities (Khaki - Not Released)
- Multilingual: 57 languages with accent control
- Voice Cloning: Zero-shot and few-shot speaker adaptation
- Emotion Control: 12 distinct emotional states
- Ultra-Low Latency: <50ms time-to-first-audio
- Long-Form: Stable generation for 30+ minute audio
- Voice Conversion: Real-time voice transformation
- Singing Synthesis: Musical vocal generation
Research Limitations
BeigeTTS is released for non-commercial research purposes only. Key limitations include:
- English-only synthesis (multilingual reserved for Khaki)
- Single speaker (multi-speaker in Khaki)
- 10-second maximum generation (unlimited in Khaki)
- No voice cloning (available in Khaki)
- Research license only
Installation
pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf
# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")
# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"
# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_new_tokens=500,
temperature=0.1,
top_p=0.97,
eos_token_id=[tokenizer.eos_token_id, 262146]
)
# Decode audio (see inference script for full implementation)
Research Applications
Suggested Research Directions
- Prosody Modeling: Investigating controllable prosody generation
- Cross-Lingual Transfer: Adapting to new languages with minimal data
- Emotion Synthesis: Fine-tuning for emotional speech generation
- Compression Studies: Analyzing audio token efficiency
- Streaming Optimization: Reducing latency for real-time applications
- Robustness Analysis: Handling out-of-distribution text inputs
Academic Collaborations
We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact [email protected]
Performance Characteristics
- Inference Speed: ~150 tokens/second on A100
- Audio Quality: 24kHz (Khaki supports 48kHz)
- Latency: <500ms first audio (Khaki: <50ms)
- Memory Usage: ~16GB VRAM
Multilingual Research Notes
While BeigeTTS is English-only, the architecture supports multilingual synthesis through:
- Language-specific token embeddings
- Cross-lingual phoneme mapping
- Accent and dialect modeling
- Code-switching capabilities
The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.
Ethical Considerations & License
Non-Commercial Use Only
BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:
- β Research and academic use
- β Personal experimentation
- β Open-source contributions
- β Commercial applications
- β Production deployment
- β Monetized services
For commercial licensing of our full Khaki system, contact [email protected]
Responsible AI Guidelines
- Always disclose AI-generated content
- Do not use for impersonation without consent
- Respect privacy and intellectual property
- Consider potential biases in synthesis
- Implement appropriate safety measures
Citation
If you use BeigeTTS in your research, please cite:
@misc{blandai2024beigetss,
title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
author={BlandAI Research Team},
year={2024},
publisher={HuggingFace},
note={Scaled research version of the Khaki TTS system}
}
Related Work
BeigeTTS builds upon:
- Gemma (Google, 2024)
- NeuCodec (Neuphonic, 2024)
- Our production Khaki TTS system (not publicly available)
Future Research Releases
We plan to release additional research artifacts:
- TaupeVC: Voice conversion research model
- EcruTTS: Lightweight edge deployment model
- SandAlign: Forced alignment for TTS training
Support & Community
- Research inquiries: [email protected]
- Technical issues: GitHub Issues
- Commercial licensing: [email protected]
Acknowledgments
We thank the open-source community and our research partners. Special recognition to:
- Google for the Gemma foundation model
- Neuphonic for NeuCodec
- The broader TTS research community
Disclaimer
BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.
BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai
- Downloads last month
- 18