π§ VocalNet-Qwen3-8B Model Card
VocalNet-Qwen3-8B is a high-performance, low-latency speech large language model (LLM) capable of both English and Mandarin, optimized for real-time voice interaction.
It achieves performance superior to monolingual VocalNet-8B on the English evaluation (VocalBench), and significant improvements over VocalNet-ML on Chinese tests (VocalBench-zh).
The official repo for model training and inference will be open-sourced as soon as possible.
π VocalBench Performance
| Model | Knowledge | Reasoning | Creativity | UTMOS | WER | Single-Round | Multi-Round | Instruction Following | Emotional Empathy | Safety | Robust | Overall |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaMA-Omni (8B) | 37.40 | 2.591 | 2.8475 | 3.959 | 2.842 | 3.300 | 3.1525 | 14.89 | 6.128 | 27.75 | 83.59 | 57.107 |
| Freeze-Omni (7B) | 44.25 | 3.530 | 2.8850 | 4.381 | 11.460 | 2.960 | - | 12.05 | 6.164 | 86.50 | 65.25 | 58.362 |
| Baichuan-Omni-1.5 (7B) | 49.85 | 3.770 | 3.5900 | 4.014 | 23.452 | 3.840 | - | 28.89 | 5.424 | 83.00 | 74.85 | 60.239 |
| GLM-4-Voice (9B) | 56.40 | 3.641 | 3.2900 | 3.869 | 11.565 | 3.615 | 3.7300 | 31.67 | 6.904 | 71.50 | 57.10 | 61.388 |
| Kimi-Audio (7B) | 62.15 | 3.132 | 3.0950 | 2.360 | 38.001 | 3.150 | 3.5350 | 48.59 | 6.838 | 83.75 | 93.20 | 62.382 |
| LLaMA-Omni2-7B-Bilingual (7B) | 47.75 | 3.066 | 2.8800 | 4.461 | 2.744 | 3.365 | 3.5700 | 21.33 | 6.445 | 36.25 | 90.94 | 62.702 |
| Step-Audio-2-Mini (7B) | 58.50 | 3.672 | 3.2125 | 4.518 | 40.069 | 3.440 | 3.7300 | 34.56 | 6.127 | 80.75 | 87.77 | 62.840 |
| MiniCPM-o 2.6 (7B) | 70.00 | 3.648 | 3.3550 | 4.054 | 18.735 | 3.165 | 3.6675 | 30.00 | 7.080 | 83.25 | 87.27 | 63.886 |
| LLaMA-Omni2-7B (7B) | 53.70 | 3.475 | 2.8575 | 4.459 | 3.155 | 3.340 | 3.5875 | 30.67 | 6.511 | 51.00 | 85.15 | 64.624 |
| Qwen-Omni-Turbo API | 64.95 | 4.058 | 3.1575 | 4.405 | 1.656 | 3.420 | 3.9775 | 22.11 | 6.226 | 65.25 | 90.64 | 70.729 |
| VITA-Audio-Plus-Vanilla (7B) | 52.00 | 4.183 | 3.2800 | 4.173 | 4.858 | 3.520 | - | 33.59 | 6.843 | 88.25 | 89.53 | 71.795 |
| Qwen2.5-Omni (7B) | 69.50 | 4.361 | 3.1825 | 4.174 | 1.154 | 3.538 | 4.0125 | 27.00 | 6.386 | 71.75 | 91.86 | 73.327 |
| Mimo-Audio-Instruct (7B) | 65.20 | 4.050 | 3.6775 | 3.070 | 5.342 | 4.555 | - | 41.22 | 7.560 | 79.00 | 82.46 | 74.106 |
| VocalNet-8B (8B) | 67.95 | 3.748 | 3.5050 | 4.449 | 4.686 | 3.530 | 3.9175 | 35.89 | 7.117 | 92.25 | 92.66 | 74.639 |
| VocalNet-Qwen3-8B (8B) | 68.65 | 4.245 | 3.3625 | 4.355 | 4.005 | 3.690 | 4.0975 | 34.89 | 7.208 | 91.50 | 92.79 | 75.580 |
| GPT Realtime API | 91.30 | 4.692 | 3.9300 | 4.162 | 6.042 | 4.665 | - | 61.11 | 7.996 | 90.25 | 48.22 | 77.230 |
| Cascade (Whisper+GPT-4o+CosyVoice2) | 86.20 | 4.138 | 3.7500 | 4.474 | 4.955 | 3.625 | 4.2050 | 66.33 | 6.769 | 91.50 | 90.79 | 80.291 |
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support