--- library_name: transformers tags: - chatterbox - text-to-speech - tts - german - kartoffel language: - de base_model: - ResembleAI/chatterbox --- # Kartoffel-TTS (Based on Chatterbox) - German Text-to-Speech > Modell is still in development and was only trained on 600k samples without emotion classification on my 2 RTX 3090s. I am currently in the process of setting up more data (>2.5M) and classify the exaggeration.

## Updates - The model has been rebuilt using **Chatterbox**, Resemble AI's open-source TTS framework. This allows for **emotion exaggeration control** and improved stability. ## Model Overview Kartoffel-TTS is a German text-to-speech (TTS) model family based on **Chatterbox**, designed for natural and expressive speech synthesis. The model supports **emotion exaggeration control**, and voice cloning. ### Key Features: 1. **Emotion Exaggeration Control**: Adjust the intensity of emotions in speech, from subtle to dramatic. 2. **Expressive Speech**: Capable of producing speech with different emotional tones and expressions. 3. **Fine-Tuned for German**: Optimized for German language synthesis with a focus on naturalness and clarity. ## Installation Install the required libraries: ```bash pip install chatterbox-tts ``` --- ## Usage Example Here’s how to generate speech using Kartoffel-TTS: ```python import torch import soundfile as sf from chatterbox.tts import ChatterboxTTS from huggingface_hub import hf_hub_download from safetensors.torch import load_file MODEL_REPO = "SebastianBodza/Kartoffelbox-v0.1" T3_CHECKPOINT_FILE = "t3_kartoffelbox.safetensors" device = "cuda" if torch.cuda.is_available() else "cpu" model = ChatterboxTTS.from_pretrained(device=device) print("Downloading and applying German patch...") checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=T3_CHECKPOINT_FILE) t3_state = load_file(checkpoint_path, device="cpu") model.t3.load_state_dict(t3_state) print("Patch applied successfully.") text = "Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand." reference_audio_path = "/content/uitoll.mp3" output_path = "output_cloned_voice.wav" print("Generating speech...") with torch.inference_mode(): wav = model.generate( text, audio_prompt_path=reference_audio_path, exaggeration=0.5, temperature=0.6, cfg_weight=0.3, ) sf.write(output_path, wav.squeeze().cpu().numpy(), model.sr) print(f"Audio saved to {output_path}") ``` ## Contributing To improve the model further, additional high-quality German audio data with good transcripts are needed, especially for sounds like laughter, sighs, or other non-verbal expressions. Short audio clips (up to 60 seconds) with accurate transcriptions are particularly valuable. For those with ideas or access to relevant data, collaboration opportunities are always welcome. Reach out to discuss potential contributions. ## Acknowledgements This model builds on the following technologies: - **Chatterbox** by Resemble AI - **Cosyvoice** - **HiFT-GAN** - **Llama** - **S3Tokenizer**