You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Text to Speech for Indian Languages

It is a state-of-the-art neural text-to-speech (TTS) model specifically designed for Indian languages. Built on a Llama architecture backbone, It generates natural, expressive speech in Hindi and English with remarkable quality and ultra-low latency.

Model Overview

A 3B parameter autoregressive transformer model based on the Llama architecture. It is designed to synthesize high-quality speech from text in Hindi and English, including code-mixed scenarios. The model outputs audio at a 24kHz sampling rate using the SNAC neural codec.

Model type: Autoregressive Transformer
Base Architecture: Llama (3B parameters)
Languages: Hindi, English
Audio Codec: SNAC @ 24kHz
License: Apache 2.0
Developed by: [email protected], [email protected]
Model URL: https://huggingface.co/SachinTelecmi/Orpheus-tts-hi/tree/main

Key Features

Multilingual Support: Native Hindi and English capabilities with code-mixed support.
Ultra-Fast Inference: Sub-200ms latency on A100-80GB GPUs.
High-Quality Audio: 24kHz output with the SNAC neural codec.
Production-Ready: Optimized for real-world deployment with 4-bit quantization support.

How to Get Started with the Model

Installation

To use this model, you need to install the transformers, torch, torchaudio, snac, and bitsandbytes libraries.

pip install transformers torch torchaudio
pip install snac bitsandbytes  # For audio decoding and quantization

Basic Usage

The following Python code demonstrates how to generate speech from text using with 4-bit quantization for efficient inference.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from snac import SNAC
import soundfile as sf

# Model configuration for 4-bit inference
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "SachinTelecmi/Orpheus-tts-hi",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("SachinTelecmi/Orpheus-tts-hi", trust_remote_code=True)

# Initialize SNAC decoder
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().cuda()

END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266

# Available speakers
speakers = None 

def generate_speech(text, speaker=None, temperature=0.4, top_p=0.9):
    """Generate speech from text using specified speaker voice"""

    # Prepare input with speaker token
    prompt = f"{text}"
    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)

    input_tokens = [
        START_OF_HUMAN_TOKEN,
        *prompt_tokens,
        END_OF_HUMAN_TOKEN,
        START_OF_AI_TOKEN,
        START_OF_SPEECH_TOKEN
    ]

    input_ids = torch.tensor([input_tokens], device=model.device)

    # Calculate max tokens based on text length
    max_tokens = min(int(len(text) * 1.3) * 7 + 21, 700)

    # Generate audio tokens
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=[END_OF_SPEECH_TOKEN, END_OF_AI_TOKEN]
        )

    # Extract SNAC tokens
    generated_ids = output[0][len(input_tokens):].tolist()
    snac_tokens = [
        token_id for token_id in generated_ids
        if AUDIO_CODE_BASE_OFFSET <= token_id < (AUDIO_CODE_BASE_OFFSET + 7 * 4096)
    ]

    if not snac_tokens:
        raise ValueError("No audio tokens generated")

    # Decode audio
    audio = decode_snac_tokens(snac_tokens, snac_model)
    return audio

def decode_snac_tokens(snac_tokens, snac_model):
    """De-interleave and decode SNAC tokens to audio"""
    if not snac_tokens or len(snac_tokens) % 7 != 0:
        return None

    # De-interleave tokens into 3 hierarchical levels
    codes_lvl = [[] for _ in range(3)]
    llm_codebook_offsets = [AUDIO_CODE_BASE_OFFSET + i * 4096 for i in range(7)]

    for i in range(0, len(snac_tokens), 7):
        # Level 0: Coarse (1 token)
        codes_lvl[0].append(snac_tokens[i] - llm_codebook_offsets[0])
        # Level 1: Medium (2 tokens)
        codes_lvl[1].append(snac_tokens[i+1] - llm_codebook_offsets[1])
        codes_lvl[1].append(snac_tokens[i+4] - llm_codebook_offsets[4])
        # Level 2: Fine (4 tokens)
        codes_lvl[2].append(snac_tokens[i+2] - llm_codebook_offsets[2])
        codes_lvl[2].append(snac_tokens[i+3] - llm_codebook_offsets[3])
        codes_lvl[2].append(snac_tokens[i+5] - llm_codebook_offsets[5])
        codes_lvl[2].append(snac_tokens[i+6] - llm_codebook_offsets[6])

    # Convert to tensors for SNAC decoder
    hierarchical_codes = []
    for lvl_codes in codes_lvl:
        tensor = torch.tensor(lvl_codes, dtype=torch.int32, device=snac_model.device).unsqueeze(0)
        if torch.any((tensor < 0) | (tensor > 4095)):
            raise ValueError("Invalid SNAC token values")
        hierarchical_codes.append(tensor)

    # Decode with SNAC
    with torch.no_grad():
        audio_hat = snac_model.decode(hierarchical_codes)

    return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()





# --- Example Usage ---

# code-mixed
prompt ='''Delhi की एक retail chain ने हमारे solutions से अपनी sales में 30% तक वृद्धि देखी है। <hmm..> उनका feedback बहुत encouraging रहा है ।'''
audio = generate_speech(prompt)
sf.write("output_1.wav", audio, 24000)

prompt = '''जी हाँ, हमारे pricing plans काफी flexible हैं <breath> 
    आप pay as you go या fixed subscription में से choose कर सकते हैं, ''' 
audio = generate_speech(prompt)
sf.write("output_2.wav", audio, 24000)

Streaming Inference Example

Clone this repo

git clone https://github.com/telecmi/Orpheus-TTS

Navigate and install packages
```
cd Orpheus-TTS && pip install orpheus-speech # uses vllm under the hood for fast inference
```
vllm pushed a slightly buggy version on March 18th so some bugs are being resolved by reverting to pip install vllm==0.7.3 after pip install orpheus-speech

Run the example below:

from orpheus_tts import OrpheusModel
import wave
import time

## checkpoints folder form huggingface 
## https://huggingface.co/SachinTelecmi/Orpheus-tts-hi

model = OrpheusModel(model_name ="checkpoints", max_model_len=2048)
prompt ='''Delhi की एक retail chain ने हमारे solutions से अपनी sales में 30% तक वृद्धि देखी है। <hmm..> उनका feedback बहुत encouraging रहा है ।'''
filename = "prompt-hi.wav"
start_time = time.monotonic()
syn_tokens = model.generate_speech(
   prompt=prompt,
   voice=None,
   )

with wave.open(filename, "wb") as wf:
   wf.setnchannels(1)
   wf.setsampwidth(2)
   wf.setframerate(24000)

   total_frames = 0
   chunk_counter = 0
   for audio_chunk in syn_tokens:  # output streaming
      chunk_counter += 1
      frame_count = len(audio_chunk) // (wf.getsampwidth() * wf.getnchannels())
      total_frames += frame_count
      wf.writeframes(audio_chunk)
   duration = total_frames / wf.getframerate()

end_time = time.monotonic()
print(f"It took {end_time - start_time} seconds to generate {duration:.2f} seconds of audio")
Audio(filename)

# inference script is in Orpheus-TTS/realtime_streaming_example/streaming.py

Samples

▶️ Listen: Output_1.wav

▶️ Listen: Output_2.wav

Voice Cloning Example

# check https://github.com/telecmi/Orpheus-TTS/tree/main/voice_clone/clone.py script 

from voice_clone import OrpheusTTSVoiceClone
from pathlib import Path

voice_cloner = OrpheusTTSVoiceClone(model_name = "SachinTelecmi/Orpheus-tts-hi",device="cuda")
    
# Text to synthesize
target_texts = [
   "Hi IIT madras is currently doing great for indian research and its proud to be associated with it."
]

reference_pairs = [(".voice_clone/input_reference.wav", 
                  "Delhi की एक retail chain ने हमारे solutions से अपनी sales में 30% तक वृद्धि देखी है। <hmm..> उनका feedback बहुत encouraging रहा है ।")]
# Process each reference
for audio_path, transcript in reference_pairs:
   print(f"Processing reference: {audio_path} - {transcript}")
   
   # Clone voice
   cloned_audio = voice_cloner.clone_voice(audio_path, transcript, target_texts)
   
   # Prepare output paths
   audio_stem = Path(audio_path).stem
   output_dir = Path(audio_path).parent / "inference"
   output_paths = [
      str(output_dir / f"{audio_stem}_{i}.wav") 
      for i in range(len(target_texts))
   ]
   
   # Save cloned audio
   voice_cloner.save_audio(cloned_audio, output_paths)

Uses

It is ideal for a wide range of applications requiring high-quality, low-latency speech synthesis for Indian languages, including:

Accessibility: Screen readers and voice-enabled assistance for visually impaired users.
Customer Service: IVR systems, voice bots, and automated announcements.
Content Creation: Dubbing for videos, e-learning materials, and audiobooks.
Automotive: In-car navigation and infotainment systems.
Edge Devices: Voice-enabled smart devices and IoT applications.
Real time streaming Supports real time streaming with vllm TTFB around < 200ms on a100

Technical Improvements

To get the ultralow latency, the fundamental solution would be warming up the SNAC decoder which is the bottleneck here!! while implementing vllm to improve the latency we have to warm up each and every module to get the better performance.

Architecture

It leverages a 3B parameter transformer-based architecture with several key innovations:

Base Architecture: Llama-style autoregressive transformer (3B parameters)
Audio Codec: SNAC (24kHz) for high-quality audio token generation
Speaker Conditioning: Special Non Speech tokens (<hmm..>, <breath>, <think> etc..)
Parameter-Efficient Training: LoRA adaptation with differentiated ranks for attention and FFN modules.
Context Length: 4096 tokens

Training

Training Infrastructure

Hardware: 1× NVIDIA A100 80GB GPUs
Precision: BF16 mixed precision training with gradient checkpointing using Unsloth library
Memory Optimization: 4-bit quantization

Training Configuration

Full fine tuning

Training Data

It was trained on proprietary, high-quality datasets specifically curated for Indian language TTS.

Data Volume: 4000 audio utterances of a single speaker
Languages: Native Hindi and English utterances with code-mixed support
Speaker Diversity: 1 professional voice artist with distinct characteristics
Audio Quality: Studio-grade recordings at 24kHz sampling rate
Content Diversity: Conversational, narrative, expressive, and informational styles

Note: The training datasets are proprietary and not publicly available. |

Risks, Limitations and Biases

Language Support: Currently supports only Hindi and English. Performance on other Indian languages is not guaranteed.
Speaker Diversity: Limited to 1 speaker voice, which may not represent the full diversity of Indian accents and dialects.
Hardware Requirements: Requires a GPU for real-time or near-real-time inference. CPU performance will be significantly slower.
Input Length: The model is limited to a maximum input length of 2048 tokens.
Bias: The model's performance and voice characteristics are a reflection of the proprietary training data. It may exhibit biases present in the data.

Future Updates

actively working on expanding its capabilities:

Support for Odia, Tamil, Telugu, Bengali, Marathi, and other Indian languages.
Additional speaker voices with regional accents.
Emotion and prosody control tokens.
CPU optimization for edge deployment.
Served with TensorRT-LLM engine (On-going)

Acknowledgments

This project builds on unsloth/orpheus-3b-0.1-ft by Unsloth.

Downloads last month: 162

Safetensors

Model size

3.3B params

Tensor type

F32

BF16