🎧 VocalNet-Qwen3-8B Model Card

VocalNet-Qwen3-8B is a high-performance, low-latency speech large language model (LLM) capable of both English and Mandarin, optimized for real-time voice interaction.

It achieves performance superior to monolingual VocalNet-8B on the English evaluation (VocalBench), and significant improvements over VocalNet-ML on Chinese tests (VocalBench-zh).

The official repo for model training and inference will be open-sourced as soon as possible.

πŸ† VocalBench Performance

Model Knowledge Reasoning Creativity UTMOS WER Single-Round Multi-Round Instruction Following Emotional Empathy Safety Robust Overall
LLaMA-Omni (8B) 37.40 2.591 2.8475 3.959 2.842 3.300 3.1525 14.89 6.128 27.75 83.59 57.107
Freeze-Omni (7B) 44.25 3.530 2.8850 4.381 11.460 2.960 - 12.05 6.164 86.50 65.25 58.362
Baichuan-Omni-1.5 (7B) 49.85 3.770 3.5900 4.014 23.452 3.840 - 28.89 5.424 83.00 74.85 60.239
GLM-4-Voice (9B) 56.40 3.641 3.2900 3.869 11.565 3.615 3.7300 31.67 6.904 71.50 57.10 61.388
Kimi-Audio (7B) 62.15 3.132 3.0950 2.360 38.001 3.150 3.5350 48.59 6.838 83.75 93.20 62.382
LLaMA-Omni2-7B-Bilingual (7B) 47.75 3.066 2.8800 4.461 2.744 3.365 3.5700 21.33 6.445 36.25 90.94 62.702
Step-Audio-2-Mini (7B) 58.50 3.672 3.2125 4.518 40.069 3.440 3.7300 34.56 6.127 80.75 87.77 62.840
MiniCPM-o 2.6 (7B) 70.00 3.648 3.3550 4.054 18.735 3.165 3.6675 30.00 7.080 83.25 87.27 63.886
LLaMA-Omni2-7B (7B) 53.70 3.475 2.8575 4.459 3.155 3.340 3.5875 30.67 6.511 51.00 85.15 64.624
Qwen-Omni-Turbo API 64.95 4.058 3.1575 4.405 1.656 3.420 3.9775 22.11 6.226 65.25 90.64 70.729
VITA-Audio-Plus-Vanilla (7B) 52.00 4.183 3.2800 4.173 4.858 3.520 - 33.59 6.843 88.25 89.53 71.795
Qwen2.5-Omni (7B) 69.50 4.361 3.1825 4.174 1.154 3.538 4.0125 27.00 6.386 71.75 91.86 73.327
Mimo-Audio-Instruct (7B) 65.20 4.050 3.6775 3.070 5.342 4.555 - 41.22 7.560 79.00 82.46 74.106
VocalNet-8B (8B) 67.95 3.748 3.5050 4.449 4.686 3.530 3.9175 35.89 7.117 92.25 92.66 74.639
VocalNet-Qwen3-8B (8B) 68.65 4.245 3.3625 4.355 4.005 3.690 4.0975 34.89 7.208 91.50 92.79 75.580
GPT Realtime API 91.30 4.692 3.9300 4.162 6.042 4.665 - 61.11 7.996 90.25 48.22 77.230
Cascade (Whisper+GPT-4o+CosyVoice2) 86.20 4.138 3.7500 4.474 4.955 3.625 4.2050 66.33 6.769 91.50 90.79 80.291
Downloads last month
2
Safetensors
Model size
12B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for VocalNet/VocalNet-Qwen3-8B

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(575)
this model