Soprano: Instant, Ultra‑Realistic Text‑to‑Speech

Overview

Soprano is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy.

With only 80M parameters, Soprano achieves a real‑time factor (RTF) of ~2000×, capable of generating 10 hours of audio in under 20 seconds. Soprano uses a seamless streaming technique that enables true real‑time synthesis in <15 ms, multiple orders of magnitude faster than existing TTS pipelines.

This space contains the model weights for Soprano. The LLM uses a standard Qwen3 architecture, and the decoder is a Vocos model fine-tuned on the output hidden states of the LLM.

Github: https://github.com/ekwek1/soprano

Model Demo: https://huggingface.co/spaces/ekwek/Soprano-TTS

Installation

Requirements: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon).

One‑line install

pip install soprano-tts

Install from source

git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .

Note: Soprano uses LMDeploy to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace transformers backend (with slower performance). To enable this, pass backend='transformers' when creating the TTS model.

Usage

from soprano import SopranoTTS

model = SopranoTTS()

Basic inference

out = model.infer("Hello world!")

Save output to a file

out = model.infer("Hello world!", "out.wav")

Custom sampling parameters

out = model.infer(
    "Hello world!",
    temperature=0.3,
    top_p=0.95,
    repetition_penalty=1.2,
)

Batched inference

out = model.infer_batch(["Hello world!"] * 10)

Save batch outputs to a directory

out = model.infer_batch(["Hello world!"] * 10, "/dir")

Streaming inference

import torch

stream = model.infer_stream("Hello world!", chunk_size=1)

# Audio chunks can be accessed via an iterator
chunks = []
for chunk in stream:
    chunks.append(chunk)

out = torch.cat(chunks)

Key Features

1. High‑fidelity 32 kHz audio

Soprano synthesizes speech at 32 kHz, delivering clarity that is perceptually indistinguishable from 44.1 kHz audio and significantly higher quality than the 24 kHz output used by many existing TTS models.

2. Vocos‑based neural decoder

Instead of slow diffusion decoders, Soprano uses a Vocos‑based decoder, enabling orders‑of‑magnitude faster waveform generation while maintaining comparable perceptual quality.

3. Seamless real‑time streaming

Soprano leverages the decoder’s finite receptive field to losslessly stream audio with ultra‑low latency. The streamed output is acoustically identical to offline synthesis, enabling interactive applications with sub‑frame delays.

4. State‑of‑the‑art neural audio codec

Speech is represented using a neural codec that compresses audio to ~15 tokens/sec at just 0.2 kbps, allowing extremely fast generation and efficient memory usage without sacrificing quality.

5. Sentence‑level streaming for infinite context

Each sentence is generated independently, enabling effectively infinite generation length while maintaining stability and real‑time performance for long‑form generation.

License

This project is licensed under the Apache-2.0 license.

Downloads last month: 717

Safetensors

Model size

79.7M params

Tensor type

BF16

ekwek
/

Soprano-80M