🎧 VocalNet-Qwen3-1.7B Model Card

VocalNet-Qwen3-1.7B is a high-performance, low-latency speech large language model (LLM) capable of both English and Mandarin, optimized for real-time voice interaction.

The official repo for model training and inference will be open-sourced as soon as possible.

πŸ† VocalBench Performance

Model Knowledge Reasoning Creativity UTMOS WER Single-Round Multi-Round Instruction Following Emotional Empathy Safety Robust Overall
Mini-Omni (0.5B) 2.20 1.291 1.4725 4.435 19.571 1.645 - 0.00 5.428 81.25 84.14 40.646
Mini-Omni2 (0.5B) 4.65 1.501 1.8025 4.413 36.269 1.915 - 0.11 5.709 88.50 82.26 43.224
SLAM-Omni (0.5B) 12.05 1.875 2.5175 4.424 6.065 2.880 1.9800 3.11 6.452 90.25 77.91 54.649
VocalNet-1B (1B) 43.00 2.869 3.1800 4.437 5.123 3.335 3.2550 16.11 6.754 89.00 92.42 66.632
VocalNet-Qwen3-1.7B (1.7B) 45.65 3.712 3.3625 4.353 1.775 3.450 3.6325 31.89 7.000 82.75 91.47 72.152
Downloads last month
3
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for VocalNet/VocalNet-Qwen3-1.7B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(334)
this model