🎧 VocalNet-Qwen3-8B Model Card

VocalNet-Qwen3-8B is a high-performance, low-latency speech large language model (LLM) capable of both English and Mandarin, optimized for real-time voice interaction.

It achieves performance superior to monolingual VocalNet-8B on the English evaluation (VocalBench), and significant improvements over VocalNet-ML on Chinese tests (VocalBench-zh).

The official repo for model training and inference will be open-sourced as soon as possible.

🏆 VocalBench Performance

Model	Knowledge	Reasoning	Creativity	UTMOS	WER	Single-Round	Multi-Round	Instruction Following	Emotional Empathy	Safety	Robust	Overall
LLaMA-Omni (8B)	37.40	2.591	2.8475	3.959	2.842	3.300	3.1525	14.89	6.128	27.75	83.59	57.107
Freeze-Omni (7B)	44.25	3.530	2.8850	4.381	11.460	2.960	-	12.05	6.164	86.50	65.25	58.362
Baichuan-Omni-1.5 (7B)	49.85	3.770	3.5900	4.014	23.452	3.840	-	28.89	5.424	83.00	74.85	60.239
GLM-4-Voice (9B)	56.40	3.641	3.2900	3.869	11.565	3.615	3.7300	31.67	6.904	71.50	57.10	61.388
Kimi-Audio (7B)	62.15	3.132	3.0950	2.360	38.001	3.150	3.5350	48.59	6.838	83.75	93.20	62.382
LLaMA-Omni2-7B-Bilingual (7B)	47.75	3.066	2.8800	4.461	2.744	3.365	3.5700	21.33	6.445	36.25	90.94	62.702
Step-Audio-2-Mini (7B)	58.50	3.672	3.2125	4.518	40.069	3.440	3.7300	34.56	6.127	80.75	87.77	62.840
MiniCPM-o 2.6 (7B)	70.00	3.648	3.3550	4.054	18.735	3.165	3.6675	30.00	7.080	83.25	87.27	63.886
LLaMA-Omni2-7B (7B)	53.70	3.475	2.8575	4.459	3.155	3.340	3.5875	30.67	6.511	51.00	85.15	64.624
Qwen-Omni-Turbo API	64.95	4.058	3.1575	4.405	1.656	3.420	3.9775	22.11	6.226	65.25	90.64	70.729
VITA-Audio-Plus-Vanilla (7B)	52.00	4.183	3.2800	4.173	4.858	3.520	-	33.59	6.843	88.25	89.53	71.795
Qwen2.5-Omni (7B)	69.50	4.361	3.1825	4.174	1.154	3.538	4.0125	27.00	6.386	71.75	91.86	73.327
Mimo-Audio-Instruct (7B)	65.20	4.050	3.6775	3.070	5.342	4.555	-	41.22	7.560	79.00	82.46	74.106
VocalNet-8B (8B)	67.95	3.748	3.5050	4.449	4.686	3.530	3.9175	35.89	7.117	92.25	92.66	74.639
VocalNet-Qwen3-8B (8B)	68.65	4.245	3.3625	4.355	4.005	3.690	4.0975	34.89	7.208	91.50	92.79	75.580
GPT Realtime API	91.30	4.692	3.9300	4.162	6.042	4.665	-	61.11	7.996	90.25	48.22	77.230
Cascade (Whisper+GPT-4o+CosyVoice2)	86.20	4.138	3.7500	4.474	4.955	3.625	4.2050	66.33	6.769	91.50	90.79	80.291

Downloads last month: 2

Safetensors

Model size

12B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VocalNet/VocalNet-Qwen3-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(575)

this model