Vikhrmodels
/

salt-qwen2.5-0.5b-asr-tts

Model card Files Files and versions

Model Performance Overview

Metrics:

PESQ: Perceptual Evaluation of Speech Quality (higher = better).
STOI: Short-Time Objective Intelligibility (closer to 1 = better).
SI-SDR: Scale-Invariant Signal-to-Distortion Ratio (higher = better).

Model	PESQ@200	STOI@200	SI-SDR@200
Fish-aduio-1.5	1.20	0.16	23.0
SALT-tts	1.11	0.16	23.58
SALT-tts+asr	1.09	0.18	23.09

Our Solution

Method: Extends a pre-trained LLM with audio tokens and fine-tunes on TTS and ASR tasks.
Training:
- BigCodec tokenizer (supports Slavic languages) for speech generation.
- SpeechTokenizer (semantic tokens only) for speech recognition.
- Training time: 168 H100 GPU hours.
Advantages: Unified LM loss for dual tasks, minimal training overhead.

Resources

Code: GitHub Repo

Downloads last month: 91

Safetensors

Model size

502M params

Tensor type

F32

·

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vikhrmodels/salt-qwen2.5-0.5b-asr-tts

Base model

Qwen/Qwen2.5-0.5B

Finetuned

(346)

this model

Quantizations

1 model

Datasets used to train Vikhrmodels/salt-qwen2.5-0.5b-asr-tts