Model Performance Overview

Metrics:

  • PESQ: Perceptual Evaluation of Speech Quality (higher = better).
  • STOI: Short-Time Objective Intelligibility (closer to 1 = better).
  • SI-SDR: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
Model PESQ@200 STOI@200 SI-SDR@200
Fish-aduio-1.5 1.20 0.16 23.0
SALT-tts 1.11 0.16 23.58
SALT-tts+asr 1.09 0.18 23.09

Our Solution

  • Method: Extends a pre-trained LLM with audio tokens and fine-tunes on TTS and ASR tasks.
  • Training:
    • BigCodec tokenizer (supports Slavic languages) for speech generation.
    • SpeechTokenizer (semantic tokens only) for speech recognition.
    • Training time: 168 H100 GPU hours.
  • Advantages: Unified LM loss for dual tasks, minimal training overhead.

Resources


Downloads last month
44
Safetensors
Model size
502M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vikhrmodels/salt-qwen2.5-0.5b-asr-tts

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(319)
this model
Quantizations
1 model

Datasets used to train Vikhrmodels/salt-qwen2.5-0.5b-asr-tts