Model Performance Overview
Metrics:
- PESQ: Perceptual Evaluation of Speech Quality (higher = better).
- STOI: Short-Time Objective Intelligibility (closer to 1 = better).
- SI-SDR: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
Model | PESQ@200 | STOI@200 | SI-SDR@200 |
---|---|---|---|
Fish-aduio-1.5 | 1.20 | 0.16 | 23.0 |
SALT-tts | 1.11 | 0.16 | 23.58 |
SALT-tts+asr | 1.09 | 0.18 | 23.09 |
Our Solution
- Method: Extends a pre-trained LLM with audio tokens and fine-tunes on TTS and ASR tasks.
- Training:
- BigCodec tokenizer (supports Slavic languages) for speech generation.
- SpeechTokenizer (semantic tokens only) for speech recognition.
- Training time: 168 H100 GPU hours.
- Advantages: Unified LM loss for dual tasks, minimal training overhead.
Resources
- Code: GitHub Repo
- Downloads last month
- 44
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support