BeigeTTS: Research Release for Neural Speech Synthesis

Overview

BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.

Research Context & Motivation

BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:

  • Multi-speaker voice cloning (10,000+ voices)
  • Real-time multilingual synthesis (57 languages)
  • Emotion and prosody transfer
  • Sub-50ms streaming latency
  • Production-grade robustness

BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.

Technical Architecture

Model Foundation

  • Base Model: Google Gemma-3 4B Instruct
  • Parameter Count: ~4 billion parameters (Khaki uses 70B+)
  • Audio Codec: NeuCodec (24kHz, single codebook)
  • Training Steps: 1,435,000 steps
  • Context Length: 2048 tokens
  • Vocabulary Size: Extended to 327,690 tokens (includes NeuCodec token space)

Research Implications

This release enables researchers to explore:

  1. Unified Text-Audio Modeling: How large language models can be adapted for audio generation tasks
  2. Token-Based Audio Synthesis: Advantages of discrete token representations over continuous methods
  3. Efficient Streaming: Real-time generation with minimal latency
  4. Cross-Modal Learning: Transfer learning between text and audio modalities

Token Space Design

The model employs a unified token space combining text and audio:

Standard Gemma Tokens: 0-262,144
Special Audio Markers:
  - AUDIO_START: 262,145
  - AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)

Capabilities & Limitations

Current Capabilities (BeigeTTS)

  • High-quality English speech synthesis
  • Natural prosody and intonation
  • Streaming generation support
  • Adjustable speaking rate and style
  • Context-aware generation

Production Capabilities (Khaki - Not Released)

  • Multilingual: 57 languages with accent control
  • Voice Cloning: Zero-shot and few-shot speaker adaptation
  • Emotion Control: 12 distinct emotional states
  • Ultra-Low Latency: <50ms time-to-first-audio
  • Long-Form: Stable generation for 30+ minute audio
  • Voice Conversion: Real-time voice transformation
  • Singing Synthesis: Musical vocal generation

Research Limitations

BeigeTTS is released for non-commercial research purposes only. Key limitations include:

  • English-only synthesis (multilingual reserved for Khaki)
  • Single speaker (multi-speaker in Khaki)
  • 10-second maximum generation (unlimited in Khaki)
  • No voice cloning (available in Khaki)
  • Research license only

Installation

pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")

# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=500,
        temperature=0.1,
        top_p=0.97,
        eos_token_id=[tokenizer.eos_token_id, 262146]
    )

# Decode audio (see inference script for full implementation)

Research Applications

Suggested Research Directions

  1. Prosody Modeling: Investigating controllable prosody generation
  2. Cross-Lingual Transfer: Adapting to new languages with minimal data
  3. Emotion Synthesis: Fine-tuning for emotional speech generation
  4. Compression Studies: Analyzing audio token efficiency
  5. Streaming Optimization: Reducing latency for real-time applications
  6. Robustness Analysis: Handling out-of-distribution text inputs

Academic Collaborations

We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact [email protected]

Performance Characteristics

  • Inference Speed: ~150 tokens/second on A100
  • Audio Quality: 24kHz (Khaki supports 48kHz)
  • Latency: <500ms first audio (Khaki: <50ms)
  • Memory Usage: ~16GB VRAM

Multilingual Research Notes

While BeigeTTS is English-only, the architecture supports multilingual synthesis through:

  • Language-specific token embeddings
  • Cross-lingual phoneme mapping
  • Accent and dialect modeling
  • Code-switching capabilities

The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.

Ethical Considerations & License

Non-Commercial Use Only

BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:

  • βœ… Research and academic use
  • βœ… Personal experimentation
  • βœ… Open-source contributions
  • ❌ Commercial applications
  • ❌ Production deployment
  • ❌ Monetized services

For commercial licensing of our full Khaki system, contact [email protected]

Responsible AI Guidelines

  • Always disclose AI-generated content
  • Do not use for impersonation without consent
  • Respect privacy and intellectual property
  • Consider potential biases in synthesis
  • Implement appropriate safety measures

Citation

If you use BeigeTTS in your research, please cite:

@misc{blandai2024beigetss,
  title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
  author={BlandAI Research Team},
  year={2024},
  publisher={HuggingFace},
  note={Scaled research version of the Khaki TTS system}
}

Related Work

BeigeTTS builds upon:

  • Gemma (Google, 2024)
  • NeuCodec (Neuphonic, 2024)
  • Our production Khaki TTS system (not publicly available)

Future Research Releases

We plan to release additional research artifacts:

  • TaupeVC: Voice conversion research model
  • EcruTTS: Lightweight edge deployment model
  • SandAlign: Forced alignment for TTS training

Support & Community

Acknowledgments

We thank the open-source community and our research partners. Special recognition to:

  • Google for the Gemma foundation model
  • Neuphonic for NeuCodec
  • The broader TTS research community

Disclaimer

BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.


BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai

Downloads last month
18
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support