BeigeTTS: Research Release for Neural Speech Synthesis

Overview

BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.

Research Context & Motivation

BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:

Multi-speaker voice cloning (10,000+ voices)
Real-time multilingual synthesis (57 languages)
Emotion and prosody transfer
Sub-50ms streaming latency
Production-grade robustness

BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.

Technical Architecture

Model Foundation

Base Model: Google Gemma-3 4B Instruct
Parameter Count: ~4 billion parameters (Khaki uses 70B+)
Audio Codec: NeuCodec (24kHz, single codebook)
Training Steps: 1,435,000 steps
Context Length: 2048 tokens
Vocabulary Size: Extended to 327,690 tokens (includes NeuCodec token space)

Research Implications

This release enables researchers to explore:

Unified Text-Audio Modeling: How large language models can be adapted for audio generation tasks
Token-Based Audio Synthesis: Advantages of discrete token representations over continuous methods
Efficient Streaming: Real-time generation with minimal latency
Cross-Modal Learning: Transfer learning between text and audio modalities

Token Space Design

The model employs a unified token space combining text and audio:

Standard Gemma Tokens: 0-262,144
Special Audio Markers:
  - AUDIO_START: 262,145
  - AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)

Capabilities & Limitations

Current Capabilities (BeigeTTS)

High-quality English speech synthesis
Natural prosody and intonation
Streaming generation support
Adjustable speaking rate and style
Context-aware generation

Production Capabilities (Khaki - Not Released)

Multilingual: 57 languages with accent control
Voice Cloning: Zero-shot and few-shot speaker adaptation
Emotion Control: 12 distinct emotional states
Ultra-Low Latency: <50ms time-to-first-audio
Long-Form: Stable generation for 30+ minute audio
Voice Conversion: Real-time voice transformation
Singing Synthesis: Musical vocal generation

Research Limitations

BeigeTTS is released for non-commercial research purposes only. Key limitations include:

English-only synthesis (multilingual reserved for Khaki)
Single speaker (multi-speaker in Khaki)
10-second maximum generation (unlimited in Khaki)
No voice cloning (available in Khaki)
Research license only

Installation

pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")

# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=500,
        temperature=0.1,
        top_p=0.97,
        eos_token_id=[tokenizer.eos_token_id, 262146]
    )

# Decode audio (see inference script for full implementation)

Research Applications

Suggested Research Directions

Prosody Modeling: Investigating controllable prosody generation
Cross-Lingual Transfer: Adapting to new languages with minimal data
Emotion Synthesis: Fine-tuning for emotional speech generation
Compression Studies: Analyzing audio token efficiency
Streaming Optimization: Reducing latency for real-time applications
Robustness Analysis: Handling out-of-distribution text inputs

Academic Collaborations

We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact [email protected]

Performance Characteristics

Inference Speed: ~150 tokens/second on A100
Audio Quality: 24kHz (Khaki supports 48kHz)
Latency: <500ms first audio (Khaki: <50ms)
Memory Usage: ~16GB VRAM

Multilingual Research Notes

While BeigeTTS is English-only, the architecture supports multilingual synthesis through:

Language-specific token embeddings
Cross-lingual phoneme mapping
Accent and dialect modeling
Code-switching capabilities

The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.

Ethical Considerations & License

Non-Commercial Use Only

BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:

✅ Research and academic use
✅ Personal experimentation
✅ Open-source contributions
❌ Commercial applications
❌ Production deployment
❌ Monetized services

For commercial licensing of our full Khaki system, contact [email protected]

Responsible AI Guidelines

Always disclose AI-generated content
Do not use for impersonation without consent
Respect privacy and intellectual property
Consider potential biases in synthesis
Implement appropriate safety measures

Citation

If you use BeigeTTS in your research, please cite:

@misc{blandai2024beigetss,
  title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
  author={BlandAI Research Team},
  year={2024},
  publisher={HuggingFace},
  note={Scaled research version of the Khaki TTS system}
}

Related Work

BeigeTTS builds upon:

Gemma (Google, 2024)
NeuCodec (Neuphonic, 2024)
Our production Khaki TTS system (not publicly available)

Future Research Releases

We plan to release additional research artifacts:

TaupeVC: Voice conversion research model
EcruTTS: Lightweight edge deployment model
SandAlign: Forced alignment for TTS training

Support & Community

Research inquiries: [email protected]
Technical issues: GitHub Issues
Commercial licensing: [email protected]

Acknowledgments

We thank the open-source community and our research partners. Special recognition to:

Google for the Gemma foundation model
Neuphonic for NeuCodec
The broader TTS research community

Disclaimer

BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.

BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai

Downloads last month: 18

Safetensors

Model size

4B params

Tensor type

BF16

Evaluation results

Metadata error: specify a dataset to view leaderboard