Spaces:
				
			
			
	
			
			
		Running
		
			on 
			
			Zero
	
	
	
			
			
	
	
	
	
		
		A newer version of the Gradio SDK is available:
									5.49.1
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
KaniTTS is a Text-to-Speech system that uses causal language models to generate speech via NeMo audio codec tokens. The project is deployed as a HuggingFace Gradio Space.
Running the Application
# Run the Gradio app (launches on http://0.0.0.0:7860)
python app.py
The app requires a HuggingFace token set as the HF_TOKEN environment variable to download models.
Architecture
Token Flow Pipeline
The system uses a custom token layout that interleaves text and audio in a single sequence:
- Input prompt construction ( - KaniModel.get_input_ids):- START_OF_HUMAN→ text tokens →- END_OF_TEXT→- END_OF_HUMAN
- Optionally prefixed with speaker ID (e.g., "andrew: Hello world")
 
- LLM generation ( - KaniModel.model_request):- Model generates sequence containing: text section + START_OF_SPEECH+ audio codec tokens +END_OF_SPEECH
 
- Model generates sequence containing: text section + 
- Audio decoding ( - NemoAudioPlayer.get_waveform):- Extracts audio tokens between START_OF_SPEECHandEND_OF_SPEECH
- Audio tokens are arranged in 4 interleaved codebooks (q=4)
- Tokens are offset by audio_tokens_start + (codebook_size * codebook_index)
- NeMo codec reconstructs waveform from the 4 codebooks
 
- Extracts audio tokens between 
Key Classes
NemoAudioPlayer (util.py:27-170)
- Loads NeMo AudioCodecModel for waveform reconstruction
- Manages special token IDs (derived from tokeniser_lengthbase)
- Validates output has required speech markers
- Extracts and decodes 4-codebook audio tokens from LLM output
- Returns 22050 Hz audio as NumPy array
KaniModel (util.py:172-303)
- Wraps HuggingFace causal LM (loaded with bfloat16, auto device mapping)
- Prepares prompts with conversation/modality control tokens
- Runs generation with sampling parameters (temp, top_p, repetition_penalty)
- Delegates audio reconstruction to NemoAudioPlayer
- Returns tuple: (audio_array, text, timing_report)
InitModels (util.py:305-343)
- Factory that loads all models from model_config.yamlat startup
- Returns dict mapping model names to KaniModelinstances
- All models share the same NemoAudioPlayerinstance
Examples (util.py:345-387)
- Converts examples.yamlstructure into Gradio Examples format
- Output order: [text, model, speaker_id, temperature, top_p, repetition_penalty, max_len]
Configuration Files
model_config.yaml
- nemo_player: NeMo codec config (model name, token layout constants)
- models: Dict of available TTS models with device_map and optional speaker_id mappings
examples.yaml
- List of example prompts with associated parameters for Gradio UI
Dependency Setup
create_env.py runs before imports in app.py to:
- Install transformers from git main branch (required for compatibility)
- Set OMP_NUM_THREADS=4
- Uses /tmp/deps_installedmarker to avoid reinstalling on every run
Important Token Constants
All special tokens are defined relative to tokeniser_length (64400):
- start_of_speech = tokeniser_length + 1
- end_of_speech = tokeniser_length + 2
- start_of_human = tokeniser_length + 3
- end_of_human = tokeniser_length + 4
- start_of_ai = tokeniser_length + 5
- end_of_ai = tokeniser_length + 6
- pad_token = tokeniser_length + 7
- audio_tokens_start = tokeniser_length + 10
- codebook_size = 4032
Multi-Speaker Support
Models with speaker_id mappings in model_config.yaml support voice selection:
- Speaker IDs are prefixed to the text prompt (e.g., "andrew: Hello")
- The Gradio UI shows/hides speaker dropdown based on selected model
- Base models (v.0.1, v.0.2) generate random voices without speaker control
HuggingFace Spaces Deployment
The README.md header contains HF Spaces metadata:
- sdk: gradiowith version 5.46.0
- app_file: app.pyas entrypoint
- References 3 model checkpoints and the NeMo codec
