cpatonn commited on
Commit
5cbd2cc
·
verified ·
1 Parent(s): 0458b6b

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,2053 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: apache-2.0
4
+ language:
5
+ - en
6
+ tags:
7
+ - multimodal
8
+ library_name: transformers
9
+ pipeline_tag: any-to-any
10
+ base_model: Qwen/Qwen3-Omni-30B-A3B-Instruct
11
+ ---
12
+
13
+ # Qwen3-Omni
14
+
15
+ <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
16
+ <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
17
+ </a>
18
+
19
+
20
+ ## Overview
21
+ ### Introduction
22
+
23
+ <p align="center">
24
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/q3o_introduction.png" width="100%"/>
25
+ <p>
26
+
27
+ Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. Key features:
28
+
29
+ * **State-of-the-art across modalities**: Early text-first pretraining and mixed multimodal training provide native multimodal support. While achieving strong audio and audio-video results, unimodal text and image performance does not regress. Reaches SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36; ASR, audio understanding, and voice conversation performance is comparable to Gemini 2.5 Pro.
30
+
31
+ * **Multilingual**: Supports 119 text languages, 19 speech input languages, and 10 speech output languages.
32
+ - **Speech Input**: English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, Urdu.
33
+ - **Speech Output**: English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean.
34
+
35
+ * **Novel Architecture**: MoE-based Thinker–Talker design with AuT pretraining for strong general representations, plus a multi-codebook design that drives latency to a minimum.
36
+
37
+ * **Real-time Audio/Video Interaction**: Low-latency streaming with natural turn-taking and immediate text or speech responses.
38
+
39
+ * **Flexible Control**: Customize behavior via system prompts for fine-grained control and easy adaptation.
40
+
41
+ * **Detailed Audio Captioner**: Qwen3-Omni-30B-A3B-Captioner is now open source: a general-purpose, highly detailed, low-hallucination audio captioning model that fills a critical gap in the open-source community.
42
+
43
+ ### Model Architecture
44
+
45
+ <p align="center">
46
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/overview.png" width="80%"/>
47
+ <p>
48
+
49
+ ### Cookbooks for Usage Cases
50
+
51
+ Qwen3-Omni supports a wide range of multimodal application scenarios, covering various domain tasks involving audio, image, video, and audio-visual modalities. Below are several cookbooks demonstrating the usage cases of Qwen3-Omni and these cookbooks include our actual execution logs. You can first follow the [QuickStart](#quickstart) guide to download the model and install the necessary inference environment dependencies, then run and experiment locally—try modifying prompts or switching model types, and enjoy exploring the capabilities of Qwen3-Omni!
52
+
53
+ <table>
54
+ <thead>
55
+ <tr>
56
+ <th>Category</th>
57
+ <th>Cookbook</th>
58
+ <th>Description</th>
59
+ <th>Open</th>
60
+ </tr>
61
+ </thead>
62
+ <tbody>
63
+ <tr>
64
+ <td rowspan="6">Audio</td>
65
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/speech_recognition.ipynb">Speech Recognition</a></td>
66
+ <td>Speech recognition, supporting multiple languages and long audio.</td>
67
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/speech_recognition.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
68
+ </tr>
69
+ <tr>
70
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/speech_translation.ipynb">Speech Translation</a></td>
71
+ <td>Speech-to-Text / Speech-to-Speech translation.</td>
72
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/speech_translation.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
73
+ </tr>
74
+ <tr>
75
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/music_analysis.ipynb">Music Analysis</a></td>
76
+ <td>Detailed analysis and appreciation of any music, including style, genre, rhythm, etc.</td>
77
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/music_analysis.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
78
+ </tr>
79
+ <tr>
80
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/sound_analysis.ipynb">Sound Analysis</a></td>
81
+ <td>Description and analysis of various sound effects and audio signals.</td>
82
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/sound_analysis.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
83
+ </tr>
84
+ <tr>
85
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_caption.ipynb">Audio Caption</a></td>
86
+ <td>Audio captioning, detailed description of any audio input.</td>
87
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_caption.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
88
+ </tr>
89
+ <tr>
90
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/mixed_audio_analysis.ipynb">Mixed Audio Analysis</a></td>
91
+ <td>Analysis of mixed audio content, such as speech, music, and environmental sounds.</td>
92
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/mixed_audio_analysis.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
93
+ </tr>
94
+ <tr>
95
+ <td rowspan="7">Visual</td>
96
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/ocr.ipynb">OCR</a></td>
97
+ <td>OCR for complex images.</td>
98
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/ocr.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
99
+ </tr>
100
+ <tr>
101
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/object_grounding.ipynb">Object Grounding</a></td>
102
+ <td>Target detection and grounding.</td>
103
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/object_grounding.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
104
+ </tr>
105
+ <tr>
106
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/image_question.ipynb">Image Question</a></td>
107
+ <td>Answering arbitrary questions about any image.</td>
108
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/image_question.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
109
+ </tr>
110
+ <tr>
111
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/image_math.ipynb">Image Math</a></td>
112
+ <td>Solving complex mathematical problems in images, highlighting the capabilities of the Thinking model.</td>
113
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/image_math.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
114
+ </tr>
115
+ <tr>
116
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/video_description.ipynb">Video Description</a></td>
117
+ <td>Detailed description of video content.</td>
118
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/video_description.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
119
+ </tr>
120
+ <tr>
121
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/video_navigation.ipynb">Video Navigation</a></td>
122
+ <td>Generating navigation commands from first-person motion videos.</td>
123
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/video_navigation.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
124
+ </tr>
125
+ <tr>
126
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/video_scene_transition.ipynb">Video Scene Transition</a></td>
127
+ <td>Analysis of scene transitions in videos.</td>
128
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/video_scene_transition.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
129
+ </tr>
130
+ <tr>
131
+ <td rowspan="3">Audio-Visual</td>
132
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_visual_question.ipynb">Audio Visual Question</a></td>
133
+ <td>Answering arbitrary questions in audio-visual scenarios, demonstrating the model's ability to model temporal alignment between audio and video.</td>
134
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_visual_question.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
135
+ </tr>
136
+ <tr>
137
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_visual_interaction.ipynb">Audio Visual Interaction</a></td>
138
+ <td>Interactive communication with the model using audio-visual inputs, including task specification via audio.</td>
139
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_visual_interaction.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
140
+ </tr>
141
+ <tr>
142
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_visual_dialogue.ipynb">Audio Visual Dialogue</a></td>
143
+ <td>Conversational interaction with the model using audio-visual inputs, showcasing its capabilities in casual chat and assistant-like behavior.</td>
144
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_visual_dialogue.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
145
+ </tr>
146
+ <tr>
147
+ <td>Agent</td>
148
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_function_call.ipynb">Audio Function Call</a></td>
149
+ <td>Using audio input to perform function calls, enabling agent-like behaviors.</td>
150
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/audio_function_call.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
151
+ </tr>
152
+ <tr>
153
+ <td>Downstream Task Fine-tuning</td>
154
+ <td><a href="https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/omni_captioner.ipynb">Omni Captioner</a></td>
155
+ <td>Introduction and capability demonstration of <strong>Qwen3-Omni-30B-A3B-Captioner</strong>, a downstream fine-tuned model based on Qwen3-Omni-30B-A3B-Instruct, illustrating the strong generalization ability of the Qwen3-Omni foundation model.</td>
156
+ <td><a href="https://colab.research.google.com/github/QwenLM/Qwen3-Omni/blob/main/cookbooks/omni_captioner.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td>
157
+ </tr>
158
+ </tbody>
159
+ </table>
160
+
161
+ ## QuickStart
162
+
163
+ ### Model Description and Download
164
+
165
+ Below is the description of all Qwen3-Omni models. Please select and download the model that fits your needs.
166
+
167
+ | Model Name | Description |
168
+ |------------------------------|-------------|
169
+ | Qwen3-Omni-30B-A3B-Instruct | The Instruct model of Qwen3-Omni-30B-A3B, containing both thinker and talker, supporting audio, video, and text input, with audio and text output. For more information, please read the [Qwen3-Omni Technical Report](https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdf). |
170
+ | Qwen3-Omni-30B-A3B-Thinking | The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output. For more information, please read the [Qwen3-Omni Technical Report](https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdf).|
171
+ | Qwen3-Omni-30B-A3B-Captioner | A downstream audio fine-grained caption model fine-tuned from Qwen3-Omni-30B-A3B-Instruct, which produces detailed, low-hallucination captions for arbitrary audio inputs. It contains the thinker, supporting audio input and text output. For more information, you can refer to the model's [cookbook](https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/omni_captioner.ipynb). |
172
+
173
+ During loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory:
174
+
175
+ ```bash
176
+ # Download through ModelScope (recommended for users in Mainland China)
177
+ pip install -U modelscope
178
+ modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instruct
179
+ modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking --local_dir ./Qwen3-Omni-30B-A3B-Thinking
180
+ modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner
181
+
182
+ # Download through Hugging Face
183
+ pip install -U "huggingface_hub[cli]"
184
+ huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./Qwen3-Omni-30B-A3B-Instruct
185
+ huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Thinking --local-dir ./Qwen3-Omni-30B-A3B-Thinking
186
+ huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Captioner --local-dir ./Qwen3-Omni-30B-A3B-Captioner
187
+ ```
188
+
189
+ ### Transformers Usage
190
+
191
+ #### Installation
192
+
193
+ The Hugging Face Transformers code for Qwen3-Omni has been successfully merged, but the PyPI package has not yet been released. Therefore, you need to install it from source using the following command. We strongly recommend that you **create a new Python environment** to avoid environment runtime issues.
194
+
195
+ ```bash
196
+ # If you already have transformers installed, please uninstall it first, or create a new Python environment
197
+ # pip uninstall transformers
198
+ pip install git+https://github.com/huggingface/transformers
199
+ pip install accelerate
200
+ ```
201
+
202
+ We offer a toolkit to help you handle various types of audio and visual input more conveniently, providing an API-like experience. This includes support for base64, URLs, and interleaved audio, images, and videos. You can install it using the following command and make sure your system has `ffmpeg` installed:
203
+
204
+ ```bash
205
+ pip install qwen-omni-utils -U
206
+ ```
207
+
208
+ Additionally, we recommend using FlashAttention 2 when running with Hugging Face Transformers to reduce GPU memory usage. However, if you are primarily using [vLLM](#vllm-usage) for inference, this installation is not necessary, as vLLM includes FlashAttention 2 by default.
209
+
210
+ ```bash
211
+ pip install -U flash-attn --no-build-isolation
212
+ ```
213
+
214
+ Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [FlashAttention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
215
+
216
+ #### Code Snippet
217
+
218
+ Here is a code snippet to show you how to use Qwen3-Omni with `transformers` and `qwen_omni_utils`:
219
+
220
+ ```python
221
+ import soundfile as sf
222
+
223
+ from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
224
+ from qwen_omni_utils import process_mm_info
225
+
226
+ MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
227
+ # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
228
+
229
+ model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
230
+ MODEL_PATH,
231
+ dtype="auto",
232
+ device_map="auto",
233
+ attn_implementation="flash_attention_2",
234
+ )
235
+
236
+ processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
237
+
238
+ conversation = [
239
+ {
240
+ "role": "user",
241
+ "content": [
242
+ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
243
+ {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
244
+ {"type": "text", "text": "What can you see and hear? Answer in one short sentence."}
245
+ ],
246
+ },
247
+ ]
248
+
249
+ # Set whether to use audio in video
250
+ USE_AUDIO_IN_VIDEO = True
251
+
252
+ # Preparation for inference
253
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
254
+ audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
255
+ inputs = processor(text=text,
256
+ audio=audios,
257
+ images=images,
258
+ videos=videos,
259
+ return_tensors="pt",
260
+ padding=True,
261
+ use_audio_in_video=USE_AUDIO_IN_VIDEO)
262
+ inputs = inputs.to(model.device).to(model.dtype)
263
+
264
+ # Inference: Generation of the output text and audio
265
+ text_ids, audio = model.generate(**inputs,
266
+ speaker="Ethan",
267
+ thinker_return_dict_in_generate=True,
268
+ use_audio_in_video=USE_AUDIO_IN_VIDEO)
269
+
270
+ text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
271
+ skip_special_tokens=True,
272
+ clean_up_tokenization_spaces=False)
273
+ print(text)
274
+ if audio is not None:
275
+ sf.write(
276
+ "output.wav",
277
+ audio.reshape(-1).detach().cpu().numpy(),
278
+ samplerate=24000,
279
+ )
280
+ ```
281
+
282
+ Here are some more advanced usage examples. You can expand the sections below to learn more.
283
+
284
+ <details>
285
+ <summary>Batch inference</summary>
286
+
287
+ The model can batch inputs composed of mixed samples of various types such as text, images, audio, and videos as input when `return_audio=False` is set. Here is an example.
288
+
289
+ ```python
290
+ from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
291
+ from qwen_omni_utils import process_mm_info
292
+
293
+ MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
294
+ # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
295
+
296
+ model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
297
+ MODEL_PATH,
298
+ dtype="auto",
299
+ device_map="auto",
300
+ attn_implementation="flash_attention_2",
301
+ )
302
+ model.disable_talker()
303
+
304
+ processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
305
+
306
+ # Conversation with image only
307
+ conversation1 = [
308
+ {
309
+ "role": "user",
310
+ "content": [
311
+ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
312
+ {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
313
+ ]
314
+ }
315
+ ]
316
+
317
+ # Conversation with audio only
318
+ conversation2 = [
319
+ {
320
+ "role": "user",
321
+ "content": [
322
+ {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
323
+ {"type": "text", "text": "What can you hear in this audio?"},
324
+ ]
325
+ }
326
+ ]
327
+
328
+ # Conversation with pure text and system prompt
329
+ conversation3 = [
330
+ {
331
+ "role": "system",
332
+ "content": [
333
+ {"type": "text", "text": "You are Qwen-Omni."}
334
+ ],
335
+ },
336
+ {
337
+ "role": "user",
338
+ "content": "Who are you?"
339
+ }
340
+ ]
341
+
342
+ # Conversation with mixed media
343
+ conversation4 = [
344
+ {
345
+ "role": "user",
346
+ "content": [
347
+ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
348
+ {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
349
+ {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
350
+ ],
351
+ }
352
+ ]
353
+
354
+ # Combine messages for batch processing
355
+ conversations = [conversation1, conversation2, conversation3, conversation4]
356
+
357
+ # Set whether to use audio in video
358
+ USE_AUDIO_IN_VIDEO = True
359
+
360
+ # Preparation for batch inference
361
+ text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
362
+ audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
363
+
364
+ inputs = processor(text=text,
365
+ audio=audios,
366
+ images=images,
367
+ videos=videos,
368
+ return_tensors="pt",
369
+ padding=True,
370
+ use_audio_in_video=USE_AUDIO_IN_VIDEO)
371
+ inputs = inputs.to(model.device).to(model.dtype)
372
+
373
+ # Batch inference does not support returning audio
374
+ text_ids, audio = model.generate(**inputs,
375
+ return_audio=False,
376
+ thinker_return_dict_in_generate=True,
377
+ use_audio_in_video=USE_AUDIO_IN_VIDEO)
378
+
379
+ text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
380
+ skip_special_tokens=True,
381
+ clean_up_tokenization_spaces=False)
382
+ print(text)
383
+ ```
384
+
385
+ </details>
386
+
387
+ <details>
388
+ <summary>Use audio output or not</summary>
389
+
390
+ The model supports both text and audio outputs. If users do not need audio outputs, they can call `model.disable_talker()` after initializing the model. This option will save about `10GB` of GPU memory, but the `return_audio` option for the `generate` function will only allow `False`.
391
+ ```python
392
+ model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
393
+ "Qwen/Qwen3-Omni-30B-A3B-Instruct",
394
+ dtype="auto",
395
+ device_map="auto",
396
+ attn_implementation="flash_attention_2",
397
+ )
398
+ model.disable_talker()
399
+ ```
400
+
401
+ For a more flexible experience, we recommend that users decide whether to return audio when the `generate` function is called. If `return_audio` is set to `False`, the model will only return text outputs, resulting in faster text responses.
402
+
403
+ ```python
404
+ model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
405
+ "Qwen/Qwen3-Omni-30B-A3B-Instruct",
406
+ dtype="auto",
407
+ device_map="auto",
408
+ attn_implementation="flash_attention_2",
409
+ )
410
+ ...
411
+ text_ids, _ = model.generate(..., return_audio=False)```
412
+
413
+ </details>
414
+
415
+ <details>
416
+ <summary>Change voice type of output audio</summary>
417
+
418
+ Qwen3-Omni supports changing the voice of the output audio. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint supports three voice types as follows:
419
+
420
+ | Voice Type | Gender | Description |
421
+ |------------|--------|-------------|
422
+ | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe. |
423
+ | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity. |
424
+ | Aiden | Male | A warm, laid-back American voice with a gentle, boyish charm. |
425
+
426
+ Users can use the `speaker` parameter of the `generate` function to specify the voice type. By default, if `speaker` is not specified, the voice type is `Ethan`.
427
+
428
+ ```python
429
+ text_ids, audio = model.generate(..., speaker="Ethan")
430
+ ```
431
+
432
+ ```python
433
+ text_ids, audio = model.generate(..., speaker="Chelsie")
434
+ ```
435
+
436
+ ```python
437
+ text_ids, audio = model.generate(..., speaker="Aiden")
438
+ ```
439
+
440
+ </details>
441
+
442
+ ### vLLM Usage
443
+
444
+ #### Installation
445
+
446
+ We strongly recommend using vLLM for inference and deployment of the Qwen3-Omni series models. Since our code is currently in the pull request stage, and **audio output inference support for the Instruct model will be released in the near future**, you can follow the commands below to install vLLM from source. Please note that we recommend you **create a new Python environment** to avoid runtime environment conflicts and incompatibilities. For more details on compiling vLLM from source, please refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation).
447
+
448
+ ```bash
449
+ git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
450
+ cd vllm
451
+ pip install -r requirements/build.txt
452
+ pip install -r requirements/cuda.txt
453
+ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
454
+ VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
455
+ # If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
456
+ # Install the Transformers
457
+ pip install git+https://github.com/huggingface/transformers
458
+ pip install accelerate
459
+ pip install qwen-omni-utils -U
460
+ pip install -U flash-attn --no-build-isolation
461
+ ```
462
+
463
+ #### Inference
464
+
465
+ You can use the following code for vLLM inference. The `limit_mm_per_prompt` parameter specifies the maximum number of each modality's data allowed per message. Since vLLM needs to pre-allocate GPU memory, larger values will require more GPU memory; if OOM issues occur, try reducing this value. Setting `tensor_parallel_size` greater than one enables multi-GPU parallel inference, improving concurrency and throughput. In addition, `max_num_seqs` indicates the number of sequences that vLLM processes in parallel during each inference step. A larger value requires more GPU memory but enables higher batch inference speed. For more details, please refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/api/vllm/index.html#vllm.LLM). Below is a simple example of how to run Qwen3-Omni with vLLM:
466
+
467
+ ```python
468
+ import os
469
+ import torch
470
+
471
+ from vllm import LLM, SamplingParams
472
+ from transformers import Qwen3OmniMoeProcessor
473
+ from qwen_omni_utils import process_mm_info
474
+
475
+ if __name__ == '__main__':
476
+ # vLLM engine v1 not supported yet
477
+ os.environ['VLLM_USE_V1'] = '0'
478
+
479
+ MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
480
+ # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
481
+
482
+ llm = LLM(
483
+ model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
484
+ tensor_parallel_size=torch.cuda.device_count(),
485
+ limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
486
+ max_num_seqs=8,
487
+ max_model_len=32768,
488
+ seed=1234,
489
+ )
490
+
491
+ sampling_params = SamplingParams(
492
+ temperature=0.6,
493
+ top_p=0.95,
494
+ top_k=20,
495
+ max_tokens=16384,
496
+ )
497
+
498
+ processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
499
+
500
+ messages = [
501
+ {
502
+ "role": "user",
503
+ "content": [
504
+ {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}
505
+ ],
506
+ }
507
+ ]
508
+
509
+ text = processor.apply_chat_template(
510
+ messages,
511
+ tokenize=False,
512
+ add_generation_prompt=True,
513
+ )
514
+ audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
515
+
516
+ inputs = {
517
+ 'prompt': text,
518
+ 'multi_modal_data': {},
519
+ "mm_processor_kwargs": {
520
+ "use_audio_in_video": True,
521
+ },
522
+ }
523
+
524
+ if images is not None:
525
+ inputs['multi_modal_data']['image'] = images
526
+ if videos is not None:
527
+ inputs['multi_modal_data']['video'] = videos
528
+ if audios is not None:
529
+ inputs['multi_modal_data']['audio'] = audios
530
+
531
+ outputs = llm.generate([inputs], sampling_params=sampling_params)
532
+
533
+ print(outputs[0].outputs[0].text)
534
+ ```
535
+
536
+ Here are some more advanced usage examples. You can expand the sections below to learn more.
537
+
538
+ <details>
539
+ <summary>Batch inference</summary>
540
+
541
+ Using vLLM enables fast batch inference, which can help you efficiently process large volumes of data or conduct benchmarking. Refer to the following code example:
542
+
543
+ ```python
544
+ import os
545
+ import torch
546
+
547
+ from vllm import LLM, SamplingParams
548
+ from transformers import Qwen3OmniMoeProcessor
549
+ from qwen_omni_utils import process_mm_info
550
+
551
+ def build_input(processor, messages, use_audio_in_video):
552
+ text = processor.apply_chat_template(
553
+ messages,
554
+ tokenize=False,
555
+ add_generation_prompt=True,
556
+ )
557
+ audios, images, videos = process_mm_info(messages, use_audio_in_video=use_audio_in_video)
558
+
559
+ inputs = {
560
+ 'prompt': text,
561
+ 'multi_modal_data': {},
562
+ "mm_processor_kwargs": {
563
+ "use_audio_in_video": use_audio_in_video,
564
+ },
565
+ }
566
+
567
+ if images is not None:
568
+ inputs['multi_modal_data']['image'] = images
569
+ if videos is not None:
570
+ inputs['multi_modal_data']['video'] = videos
571
+ if audios is not None:
572
+ inputs['multi_modal_data']['audio'] = audios
573
+
574
+ return inputs
575
+
576
+ if __name__ == '__main__':
577
+ # vLLM engine v1 not supported yet
578
+ os.environ['VLLM_USE_V1'] = '0'
579
+
580
+ MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
581
+ # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
582
+
583
+ llm = LLM(
584
+ model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
585
+ tensor_parallel_size=torch.cuda.device_count(),
586
+ limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
587
+ max_num_seqs=8,
588
+ max_model_len=32768,
589
+ seed=1234,
590
+ )
591
+
592
+ sampling_params = SamplingParams(
593
+ temperature=0.6,
594
+ top_p=0.95,
595
+ top_k=20,
596
+ max_tokens=16384,
597
+ )
598
+
599
+ processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
600
+
601
+ # Conversation with image only
602
+ conversation1 = [
603
+ {
604
+ "role": "user",
605
+ "content": [
606
+ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
607
+ {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
608
+ ]
609
+ }
610
+ ]
611
+
612
+ # Conversation with audio only
613
+ conversation2 = [
614
+ {
615
+ "role": "user",
616
+ "content": [
617
+ {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
618
+ {"type": "text", "text": "What can you hear in this audio?"},
619
+ ]
620
+ }
621
+ ]
622
+
623
+ # Conversation with pure text and system prompt
624
+ conversation3 = [
625
+ {
626
+ "role": "system",
627
+ "content": [
628
+ {"type": "text", "text": "You are Qwen-Omni."}
629
+ ],
630
+ },
631
+ {
632
+ "role": "user",
633
+ "content": "Who are you? Answer in one sentence."
634
+ }
635
+ ]
636
+
637
+ # Conversation with mixed media
638
+ conversation4 = [
639
+ {
640
+ "role": "user",
641
+ "content": [
642
+ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
643
+ {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/asr_fr.wav"},
644
+ {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
645
+ ],
646
+ }
647
+ ]
648
+
649
+ USE_AUDIO_IN_VIDEO = True
650
+
651
+ # Combine messages for batch processing
652
+ conversations = [conversation1, conversation2, conversation3, conversation4]
653
+ inputs = [build_input(processor, messages, USE_AUDIO_IN_VIDEO) for messages in conversations]
654
+
655
+ outputs = llm.generate(inputs, sampling_params=sampling_params)
656
+
657
+ result = [outputs[i].outputs[0].text for i in range(len(outputs))]
658
+ print(result)
659
+ ```
660
+
661
+ </details>
662
+
663
+ <details>
664
+ <summary>vLLM Serve Usage</summary>
665
+
666
+ vLLM serve for Qwen3-Omni currently only supports the thinker model. The `use_audio_in_video` parameter is not available in vLLM serve; you can handle this by separately passing video and audio inputs for processing. You can start vLLM serve through the following command:
667
+
668
+ ```bash
669
+ # Qwen3-Omni-30B-A3B-Instruct for single GPU
670
+ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
671
+ # Qwen3-Omni-30B-A3B-Instruct for multi-GPU (example on 4 GPUs)
672
+ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4
673
+ # Qwen/Qwen3-Omni-30B-A3B-Thinking for single GPU
674
+ vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
675
+ # Qwen/Qwen3-Omni-30B-A3B-Thinking for multi-GPU (example on 4 GPUs)
676
+ vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4
677
+ ```
678
+
679
+ Then you can use the chat API as below (via curl, for example):
680
+ ```bash
681
+ curl http://localhost:8901/v1/chat/completions \
682
+ -H "Content-Type: application/json" \
683
+ -d '{
684
+ "messages": [
685
+ {"role": "system", "content": "You are a helpful assistant."},
686
+ {"role": "user", "content": [
687
+ {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
688
+ {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
689
+ {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
690
+ ]}
691
+ ]
692
+ }'
693
+ ```
694
+
695
+ </details>
696
+
697
+ ### Usage Tips (Recommended Reading)
698
+
699
+ #### Minimum GPU memory requirements
700
+
701
+ | Model | Precision | 15s Video | 30s Video | 60s Video | 120s Video |
702
+ |------------------------------|-----------| --------- | --------- | --------- | --------- |
703
+ | Qwen3-Omni-30B-A3B-Instruct | BF16 | 78.85 GB | 88.52 GB | 107.74 GB | 144.81 GB |
704
+ | Qwen3-Omni-30B-A3B-Thinking | BF16 | 68.74 GB | 77.79 GB | 95.76 GB | 131.65 GB |
705
+
706
+ **Note**: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` precision, tested with `attn_implementation="flash_attention_2"`. The Instruct model includes both the **thinker** and **talker** components, whereas the Thinking model includes only the **thinker** part.
707
+
708
+ #### Prompt for Audio-Visual Interaction
709
+
710
+ When using Qwen3-Omni for audio-visual multimodal interaction, where the input consists of a video and its corresponding audio (with the audio serving as a query), we recommend using the **following system prompt**. This setup helps the model maintain high reasoning capability while better assuming interactive roles such as a smart assistant. Additionally, the text generated by the thinker will be more readable, with a natural, conversational tone and without complex formatting that is difficult to vocalize, leading to more stable and fluent audio output from the talker. You can customize the `user_system_prompt` field in the system prompt to include character settings or other role-specific descriptions as needed.
711
+
712
+ ```
713
+ user_system_prompt = "You are Qwen-Omni, a smart voice assistant created by Alibaba Qwen."
714
+ message = {
715
+ "role": "system",
716
+ "content": [
717
+ {"type": "text", "text": f"{user_system_prompt} You are a virtual voice assistant with no gender or age.\nYou are communicating with the user.\nIn user messages, “I/me/my/we/our” refer to the user and “you/your” refer to the assistant. In your replies, address the user as “you/your” and yourself as “I/me/my”; never mirror the user’s pronouns—always shift perspective. Keep original pronouns only in direct quotes; if a reference is unclear, ask a brief clarifying question.\nInteract with users using short(no more than 50 words), brief, straightforward language, maintaining a natural tone.\nNever use formal phrasing, mechanical expressions, bullet points, overly structured language. \nYour output must consist only of the spoken content you want the user to hear. \nDo not include any descriptions of actions, emotions, sounds, or voice changes. \nDo not use asterisks, brackets, parentheses, or any other symbols to indicate tone or actions. \nYou must answer users' audio or text questions, do not directly describe the video content. \nYou should communicate in the same language strictly as the user unless they request otherwise.\nWhen you are uncertain (e.g., you can't see/hear clearly, don't understand, or the user makes a comment rather than asking a question), use appropriate questions to guide the user to continue the conversation.\nKeep replies concise and conversational, as if talking face-to-face."}
718
+ ]
719
+ }
720
+ ```
721
+
722
+ #### Best Practices for the Thinking Model
723
+
724
+ The `Qwen3-Omni-30B-A3B-Thinking` model is primarily designed for understanding and interacting with multimodal inputs, including text, audio, image, and video. To achieve optimal performance, we recommend that users include an explicit textual instruction or task description in each round of dialogue alongside the multimodal input. This helps clarify the intent and significantly enhances the model's ability to leverage its reasoning capabilities. For example:
725
+
726
+ ```python
727
+ messages = [
728
+ {
729
+ "role": "user",
730
+ "content": [
731
+ {"type": "audio", "audio": "/path/to/audio.wav"},
732
+ {"type": "image", "image": "/path/to/image.png"},
733
+ {"type": "video", "video": "/path/to/video.mp4"},
734
+ {"type": "text", "text": "Analyze this audio, image, and video together."},
735
+ ],
736
+ }
737
+ ]
738
+ ```
739
+
740
+ #### Use audio in video
741
+
742
+ In multimodal interaction, user-provided videos are often accompanied by audio (such as spoken questions or sounds from events in the video). This information helps the model provide a better interactive experience. We provide the following options for users to decide whether to use the audio from a video.
743
+
744
+ ```python
745
+ # In data preprocessing
746
+ audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
747
+ ```
748
+
749
+ ```python
750
+ # For Transformers
751
+ text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
752
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt",
753
+ padding=True, use_audio_in_video=True)
754
+ text_ids, audio = model.generate(..., use_audio_in_video=True)
755
+
756
+ # For vLLM
757
+ text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
758
+ inputs = {
759
+ 'prompt': text,
760
+ 'multi_modal_data': {},
761
+ "mm_processor_kwargs": {
762
+ "use_audio_in_video": True,
763
+ },
764
+ }
765
+ ```
766
+
767
+ It is worth noting that during a multi-round conversation, the `use_audio_in_video` parameter must be set consistently across these steps; otherwise, unexpected results may occur.
768
+
769
+ ## Evaluation
770
+
771
+ ### Performance of Qwen3-Omni
772
+
773
+ Qwen3-Omni maintains state-of-the-art performance on text and visual modalities without degradation relative to same-size single-model Qwen counterparts. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and sets the SOTA on 22, outperforming strong closed-source systems such as Gemini 2.5 Pro and GPT-4o.
774
+
775
+ <details>
776
+ <summary>Text -> Text</summary>
777
+
778
+ <table>
779
+ <thead>
780
+ <tr>
781
+ <th colspan="2" style="text-align: left;"></th>
782
+ <th style="text-align: center;">GPT-4o-0327</th>
783
+ <th style="text-align: center;">Qwen3-235B-A22B<br>Non Thinking</th>
784
+ <th style="text-align: center;">Qwen3-30B-A3B-Instruct-2507</th>
785
+ <th style="text-align: center;">Qwen3-Omni-30B-A3B-Instruct</th>
786
+ <th style="text-align: center;">Qwen3-Omni-Flash-Instruct</th>
787
+ </tr>
788
+ </thead>
789
+ <tbody>
790
+ <tr>
791
+ <td rowspan="2" style="text-align: left; vertical-align: middle;">General<br>Tasks</td>
792
+ <td style="text-align: left;">MMLU-Redux</td>
793
+ <td style="text-align: center;"><strong>91.3</strong></td>
794
+ <td style="text-align: center;">89.2</td>
795
+ <td style="text-align: center;">89.3</td>
796
+ <td style="text-align: center;">86.6</td>
797
+ <td style="text-align: center;">86.8</td>
798
+ </tr>
799
+ <tr>
800
+ <td style="text-align: left;">GPQA</td>
801
+ <td style="text-align: center;">66.9</td>
802
+ <td style="text-align: center;">62.9</td>
803
+ <td style="text-align: center;"><strong>70.4</strong></td>
804
+ <td style="text-align: center;">69.6</td>
805
+ <td style="text-align: center;">69.7</td>
806
+ </tr>
807
+ <tr>
808
+ <td rowspan="2" style="text-align: left; vertical-align: middle;">Reasoning</td>
809
+ <td style="text-align: left;">AIME25</td>
810
+ <td style="text-align: center;">26.7</td>
811
+ <td style="text-align: center;">24.7</td>
812
+ <td style="text-align: center;">61.3</td>
813
+ <td style="text-align: center;">65.0</td>
814
+ <td style="text-align: center;"><strong>65.9</strong></td>
815
+ </tr>
816
+ <tr>
817
+ <td style="text-align: left;">ZebraLogic</td>
818
+ <td style="text-align: center;">52.6</td>
819
+ <td style="text-align: center;">37.7</td>
820
+ <td style="text-align: center;"><strong>90.0</strong></td>
821
+ <td style="text-align: center;">76.0</td>
822
+ <td style="text-align: center;">76.1</td>
823
+ </tr>
824
+ <tr>
825
+ <td style="text-align: left; vertical-align: middle;">Code</td>
826
+ <td style="text-align: left;">MultiPL-E</td>
827
+ <td style="text-align: center;">82.7</td>
828
+ <td style="text-align: center;">79.3</td>
829
+ <td style="text-align: center;"><strong>83.8</strong></td>
830
+ <td style="text-align: center;">81.4</td>
831
+ <td style="text-align: center;">81.5</td>
832
+ </tr>
833
+ </tbody>
834
+ <tbody>
835
+ <tr style="border-top: 1px solid #ddd;">
836
+ <td rowspan="3" style="text-align: left; vertical-align: middle;">Alignment<br>Tasks</td>
837
+ <td style="text-align: left;">IFEval</td>
838
+ <td style="text-align: center;">83.9</td>
839
+ <td style="text-align: center;">83.2</td>
840
+ <td style="text-align: center;"><strong>84.7</strong></td>
841
+ <td style="text-align: center;">81.0</td>
842
+ <td style="text-align: center;">81.7</td>
843
+ </tr>
844
+ <tr>
845
+ <td style="text-align: left;">Creative Writing v3</td>
846
+ <td style="text-align: center;">84.9</td>
847
+ <td style="text-align: center;">80.4</td>
848
+ <td style="text-align: center;"><strong>86.0</strong></td>
849
+ <td style="text-align: center;">80.6</td>
850
+ <td style="text-align: center;">81.8</td>
851
+ </tr>
852
+ <tr>
853
+ <td style="text-align: left;">WritingBench</td>
854
+ <td style="text-align: center;">75.5</td>
855
+ <td style="text-align: center;">77.0</td>
856
+ <td style="text-align: center;"><strong>85.5</strong></td>
857
+ <td style="text-align: center;">82.6</td>
858
+ <td style="text-align: center;">83.0</td>
859
+ </tr>
860
+ <tr>
861
+ <td style="text-align: left; vertical-align: middle;">Agent</td>
862
+ <td style="text-align: left;">BFCL-v3</td>
863
+ <td style="text-align: center;">66.5</td>
864
+ <td style="text-align: center;"><strong>68.0</strong></td>
865
+ <td style="text-align: center;">65.1</td>
866
+ <td style="text-align: center;">64.4</td>
867
+ <td style="text-align: center;">65.0</td>
868
+ </tr>
869
+ <tr>
870
+ <td rowspan="2" style="text-align: left; vertical-align: middle;">Multilingual<br>Tasks</td>
871
+ <td style="text-align: left;">MultiIF</td>
872
+ <td style="text-align: center;"><strong>70.4</strong></td>
873
+ <td style="text-align: center;">70.2</td>
874
+ <td style="text-align: center;">67.9</td>
875
+ <td style="text-align: center;">64.0</td>
876
+ <td style="text-align: center;">64.7</td>
877
+ </tr>
878
+ <tr>
879
+ <td style="text-align: left;">PolyMATH</td>
880
+ <td style="text-align: center;">25.5</td>
881
+ <td style="text-align: center;">27.0</td>
882
+ <td style="text-align: center;"><strong>43.1</strong></td>
883
+ <td style="text-align: center;">37.9</td>
884
+ <td style="text-align: center;">39.3</td>
885
+ </tr>
886
+ </tbody>
887
+ </table>
888
+
889
+ <table>
890
+ <thead>
891
+ <tr style="border-bottom: 1px solid black;">
892
+ <th></th>
893
+ <th></th>
894
+ <th>Gemini-2.5-Flash<br>Thinking</th>
895
+ <th>Qwen3-235B-A22B<br>Thinking</th>
896
+ <th>Qwen3-30B-A3B-Thinking-2507</th>
897
+ <th>Qwen3-Omni-30B-A3B-Thinking</th>
898
+ <th>Qwen3-Omni-Flash-Thinking</th>
899
+ </tr>
900
+ </thead>
901
+ <tbody>
902
+ <tr>
903
+ <td rowspan="2"><em>General<br>Tasks</em></td>
904
+ <td>MMLU-Redux</td>
905
+ <td>92.1</td>
906
+ <td><b>92.7</b></td>
907
+ <td>91.4</td>
908
+ <td>88.8</td>
909
+ <td>89.7</td>
910
+ </tr>
911
+ <tr style="border-top: 1px solid #ddd;">
912
+ <td>GPQA</td>
913
+ <td><b>82.8</b></td>
914
+ <td>71.1</td>
915
+ <td>73.4</td>
916
+ <td>73.1</td>
917
+ <td>73.1</td>
918
+ </tr>
919
+ <tr style="border-top: 1px solid black;">
920
+ <td rowspan="2"><em>Reasoning</em></td>
921
+ <td>AIME25</td>
922
+ <td>72.0</td>
923
+ <td>81.5</td>
924
+ <td><b>85.0</b></td>
925
+ <td>73.7</td>
926
+ <td>74.0</td>
927
+ </tr>
928
+ <tr style="border-top: 1px solid #ddd;">
929
+ <td>LiveBench 20241125</td>
930
+ <td>74.3</td>
931
+ <td><b>77.1</b></td>
932
+ <td>76.8</td>
933
+ <td>71.8</td>
934
+ <td>70.3</td>
935
+ </tr>
936
+ <tr style="border-top: 1px solid black;">
937
+ <td><em>Code</em></td>
938
+ <td>MultiPL-E</td>
939
+ <td><b>84.5</b></td>
940
+ <td>79.9</td>
941
+ <td>81.3</td>
942
+ <td>80.6</td>
943
+ <td>81.0</td>
944
+ </tr>
945
+ <tr style="border-top: 1px solid #ddd;">
946
+ <td rowspan="4"><em>Alignment<br>Tasks</em></td>
947
+ <td>IFEval</td>
948
+ <td><b>89.8</b></td>
949
+ <td>83.4</td>
950
+ <td>88.9</td>
951
+ <td>85.1</td>
952
+ <td>85.2</td>
953
+ </tr>
954
+ <tr style="border-top: 1px solid #ddd;">
955
+ <td>Arena-Hard v2</td>
956
+ <td>56.7</td>
957
+ <td><b>61.5</b></td>
958
+ <td>56.0</td>
959
+ <td>55.1</td>
960
+ <td>57.8</td>
961
+ </tr>
962
+ <tr style="border-top: 1px solid #ddd;">
963
+ <td>Creative Writing v3</td>
964
+ <td><b>85.0</b></td>
965
+ <td>84.6</td>
966
+ <td>84.4</td>
967
+ <td>82.5</td>
968
+ <td>83.6</td>
969
+ </tr>
970
+ <tr style="border-top: 1px solid #ddd;">
971
+ <td>WritingBench</td>
972
+ <td>83.9</td>
973
+ <td>80.3</td>
974
+ <td>85.0</td>
975
+ <td>85.5</td>
976
+ <td><b>85.9</b></td>
977
+ </tr>
978
+ <tr style="border-top: 1px solid black;">
979
+ <td><em>Agent</em></td>
980
+ <td>BFCL-v3</td>
981
+ <td>68.6</td>
982
+ <td>70.8</td>
983
+ <td><b>72.4</b></td>
984
+ <td>63.2</td>
985
+ <td>64.5</td>
986
+ </tr>
987
+ <tr style="border-top: 1px solid black;">
988
+ <td rowspan="2"><em>Multilingual<br>Tasks</em></td>
989
+ <td>MultiIF</td>
990
+ <td>74.4</td>
991
+ <td>71.9</td>
992
+ <td><b>76.4</b></td>
993
+ <td>72.9</td>
994
+ <td>73.2</td>
995
+ </tr>
996
+ <tr>
997
+ <td>PolyMATH</td>
998
+ <td>49.8</td>
999
+ <td><b>54.7</b></td>
1000
+ <td>52.6</td>
1001
+ <td>47.1</td>
1002
+ <td>48.7</td>
1003
+ </tr>
1004
+ </tbody>
1005
+ </table>
1006
+
1007
+ </details>
1008
+
1009
+ <details>
1010
+ <summary>Audio -> Text</summary>
1011
+
1012
+ <table style="width:100%; border-collapse: collapse;">
1013
+ <thead>
1014
+ <tr>
1015
+ <th align="left" style="padding: 8px;"></th>
1016
+ <th align="center" style="padding: 8px;">Seed-ASR</th>
1017
+ <th align="center" style="padding: 8px;">Voxtral-Mini</th>
1018
+ <th align="center" style="padding: 8px;">Voxtral-Small</th>
1019
+ <th align="center" style="padding: 8px;">GPT-4o-Transcribe</th>
1020
+ <th align="center" style="padding: 8px;">Gemini-2.5-Pro</th>
1021
+ <th align="center" style="padding: 8px;">Qwen2.5-Omni</th>
1022
+ <th align="center" style="padding: 8px;">Qwen3-Omni-30B-A3B-Instruct</th>
1023
+ <th align="center" style="padding: 8px;">Qwen3-Omni-Flash-Instruct</th>
1024
+ </tr>
1025
+ </thead>
1026
+ <tbody>
1027
+ <tr style="border-top: 1px solid #333;">
1028
+ <td colspan="9" align="center"; style="border-top: 1px solid black; border-bottom: 1px solid black;"><em>EN & ZH ASR (wer)</em></td>
1029
+ </tr>
1030
+ <tr>
1031
+ <td align="left" style="padding: 8px;">Wenetspeech<br><em>net</em> | <em>meeting</em></td>
1032
+ <td align="center" style="padding: 8px;">4.66 | <strong>5.69</strong></td>
1033
+ <td align="center" style="padding: 8px;">24.30 | 31.53</td>
1034
+ <td align="center" style="padding: 8px;">20.33 | 26.08</td>
1035
+ <td align="center" style="padding: 8px;">15.30 | 32.27</td>
1036
+ <td align="center" style="padding: 8px;">14.43 | 13.47</td>
1037
+ <td align="center" style="padding: 8px;">5.91 | 7.65</td>
1038
+ <td align="center" style="padding: 8px;">4.69 | 5.89</td>
1039
+ <td align="center" style="padding: 8px;"><strong>4.62</strong> | 5.75</td>
1040
+ </tr>
1041
+ <tr>
1042
+ <td align="left" style="padding: 8px;">Librispeech<br><em>clean</em> | <em>other</em></td>
1043
+ <td align="center" style="padding: 8px;">1.58 | 2.84</td>
1044
+ <td align="center" style="padding: 8px;">1.88 | 4.12</td>
1045
+ <td align="center" style="padding: 8px;">1.56 | 3.30</td>
1046
+ <td align="center" style="padding: 8px;">1.39 | 3.75</td>
1047
+ <td align="center" style="padding: 8px;">2.89 | 3.56</td>
1048
+ <td align="center" style="padding: 8px;">1.74 | 3.45</td>
1049
+ <td align="center" style="padding: 8px;"><strong>1.22</strong> | 2.48</td>
1050
+ <td align="center" style="padding: 8px;">1.27 | <strong>2.44</strong></td>
1051
+ </tr>
1052
+ <tr>
1053
+ <td align="left" style="padding: 8px;">CV15-en</td>
1054
+ <td align="center" style="padding: 8px;">-</td>
1055
+ <td align="center" style="padding: 8px;">9.47</td>
1056
+ <td align="center" style="padding: 8px;">7.79</td>
1057
+ <td align="center" style="padding: 8px;">10.01</td>
1058
+ <td align="center" style="padding: 8px;">9.89</td>
1059
+ <td align="center" style="padding: 8px;">7.61</td>
1060
+ <td align="center" style="padding: 8px;">6.05</td>
1061
+ <td align="center" style="padding: 8px;"><strong>5.94</strong></td>
1062
+ </tr>
1063
+ <tr>
1064
+ <td align="left" style="padding: 8px;">CV15-zh</td>
1065
+ <td align="center" style="padding: 8px;">-</td>
1066
+ <td align="center" style="padding: 8px;">24.67</td>
1067
+ <td align="center" style="padding: 8px;">19.30</td>
1068
+ <td align="center" style="padding: 8px;">9.84</td>
1069
+ <td align="center" style="padding: 8px;">8.00</td>
1070
+ <td align="center" style="padding: 8px;">5.13</td>
1071
+ <td align="center" style="padding: 8px;">4.31</td>
1072
+ <td align="center" style="padding: 8px;"><strong>4.28</strong></td>
1073
+ </tr>
1074
+ <tr>
1075
+ <td align="left" style="padding: 8px;">Fleurs-en</td>
1076
+ <td align="center" style="padding: 8px;">3.40</td>
1077
+ <td align="center" style="padding: 8px;">3.96</td>
1078
+ <td align="center" style="padding: 8px;">3.77</td>
1079
+ <td align="center" style="padding: 8px;">3.32</td>
1080
+ <td align="center" style="padding: 8px;">2.94</td>
1081
+ <td align="center" style="padding: 8px;">3.77</td>
1082
+ <td align="center" style="padding: 8px;"><strong>2.72</strong></td>
1083
+ <td align="center" style="padding: 8px;">2.74</td>
1084
+ </tr>
1085
+ <tr>
1086
+ <td align="left" style="padding: 8px;">Fleurs-zh</td>
1087
+ <td align="center" style="padding: 8px;">2.69</td>
1088
+ <td align="center" style="padding: 8px;">12.22</td>
1089
+ <td align="center" style="padding: 8px;">7.98</td>
1090
+ <td align="center" style="padding: 8px;">2.44</td>
1091
+ <td align="center" style="padding: 8px;">2.71</td>
1092
+ <td align="center" style="padding: 8px;">2.54</td>
1093
+ <td align="center" style="padding: 8px;">2.20</td>
1094
+ <td align="center" style="padding: 8px;"><strong>2.19</strong></td>
1095
+ </tr>
1096
+ <tr style="border-top: 1px solid #333;">
1097
+ <td colspan="9" align="center"; style="border-top: 1px solid black; border-bottom: 1px solid black;"><em>Multilingual ASR (wer)</em></td>
1098
+ </tr>
1099
+ <tr>
1100
+ <td align="left" style="padding: 8px;">Fleurs-avg<br>(19 lang)</td>
1101
+ <td align="center" style="padding: 8px;">-</td>
1102
+ <td align="center" style="padding: 8px;">15.67</td>
1103
+ <td align="center" style="padding: 8px;">8.09</td>
1104
+ <td align="center" style="padding: 8px;">4.48</td>
1105
+ <td align="center" style="padding: 8px;">5.55</td>
1106
+ <td align="center" style="padding: 8px;">14.04</td>
1107
+ <td align="center" style="padding: 8px;">5.33</td>
1108
+ <td align="center" style="padding: 8px;"><strong>5.31</strong></td>
1109
+ </tr>
1110
+ <tr style="border-top: 1px solid #333;">
1111
+ <td colspan="9" align="center"; style="border-top: 1px solid black; border-bottom: 1px solid black;"><em>Lyric ASR (wer)</em></td>
1112
+ </tr>
1113
+ <tr>
1114
+ <td align="left" style="padding: 8px;">MIR-1K (vocal-only)</td>
1115
+ <td align="center" style="padding: 8px;">6.45</td>
1116
+ <td align="center" style="padding: 8px;">23.33</td>
1117
+ <td align="center" style="padding: 8px;">18.73</td>
1118
+ <td align="center" style="padding: 8px;">11.87</td>
1119
+ <td align="center" style="padding: 8px;">9.85</td>
1120
+ <td align="center" style="padding: 8px;">8.15</td>
1121
+ <td align="center" style="padding: 8px;">5.90</td>
1122
+ <td align="center" style="padding: 8px;"><strong>5.85</strong></td>
1123
+ </tr>
1124
+ <tr>
1125
+ <td align="left" style="padding: 8px;">Opencpop-test</td>
1126
+ <td align="center" style="padding: 8px;">2.98</td>
1127
+ <td align="center" style="padding: 8px;">31.01</td>
1128
+ <td align="center" style="padding: 8px;">16.06</td>
1129
+ <td align="center" style="padding: 8px;">7.93</td>
1130
+ <td align="center" style="padding: 8px;">6.49</td>
1131
+ <td align="center" style="padding: 8px;">2.84</td>
1132
+ <td align="center" style="padding: 8px;"><strong>1.54</strong></td>
1133
+ <td align="center" style="padding: 8px;">2.02</td>
1134
+ </tr>
1135
+ <tr style="border-top: 1px solid #333;">
1136
+ <td colspan="9" align="center"; style="border-top: 1px solid black; border-bottom: 1px solid black;"><em>S2TT (BLEU)</em></td>
1137
+ </tr>
1138
+ <tr>
1139
+ <td align="left" style="padding: 8px;">Fleurs-en2xx</td>
1140
+ <td align="center" style="padding: 8px;">-</td>
1141
+ <td align="center" style="padding: 8px;">30.35</td>
1142
+ <td align="center" style="padding: 8px;">37.85</td>
1143
+ <td align="center" style="padding: 8px;">-</td>
1144
+ <td align="center" style="padding: 8px;"><strong>39.25</strong></td>
1145
+ <td align="center" style="padding: 8px;">29.22</td>
1146
+ <td align="center" style="padding: 8px;">37.50</td>
1147
+ <td align="center" style="padding: 8px;">36.22</td>
1148
+ </tr>
1149
+ <tr>
1150
+ <td align="left" style="padding: 8px;">Fleurs-xx2en</td>
1151
+ <td align="center" style="padding: 8px;">-</td>
1152
+ <td align="center" style="padding: 8px;">27.54</td>
1153
+ <td align="center" style="padding: 8px;">32.81</td>
1154
+ <td align="center" style="padding: 8px;">-</td>
1155
+ <td align="center" style="padding: 8px;"><strong>35.41</strong></td>
1156
+ <td align="center" style="padding: 8px;">28.61</td>
1157
+ <td align="center" style="padding: 8px;">31.08</td>
1158
+ <td align="center" style="padding: 8px;">30.71</td>
1159
+ </tr>
1160
+ <tr>
1161
+ <td align="left" style="padding: 8px;">Fleurs-zh2xx</td>
1162
+ <td align="center" style="padding: 8px;">-</td>
1163
+ <td align="center" style="padding: 8px;">17.03</td>
1164
+ <td align="center" style="padding: 8px;">22.05</td>
1165
+ <td align="center" style="padding: 8px;">-</td>
1166
+ <td align="center" style="padding: 8px;"><strong>26.63</strong></td>
1167
+ <td align="center" style="padding: 8px;">17.97</td>
1168
+ <td align="center" style="padding: 8px;">25.17</td>
1169
+ <td align="center" style="padding: 8px;">25.10</td>
1170
+ </tr>
1171
+ <tr>
1172
+ <td align="left" style="padding: 8px;">Fleurs-xx2zh</td>
1173
+ <td align="center" style="padding: 8px;">-</td>
1174
+ <td align="center" style="padding: 8px;">28.75</td>
1175
+ <td align="center" style="padding: 8px;">34.82</td>
1176
+ <td align="center" style="padding: 8px;">-</td>
1177
+ <td align="center" style="padding: 8px;"><strong>37.50</strong></td>
1178
+ <td align="center" style="padding: 8px;">27.68</td>
1179
+ <td align="center" style="padding: 8px;">33.13</td>
1180
+ <td align="center" style="padding: 8px;">31.19</td>
1181
+ </tr>
1182
+ </tbody>
1183
+ </table>
1184
+
1185
+ <table style="width:100%; border-collapse: collapse;">
1186
+ <thead>
1187
+ <tr style="border-bottom: 1px solid #ddd;">
1188
+ <th style="text-align:left; padding: 8px;"></th>
1189
+ <th style="text-align:center; padding: 8px;">GPT-4o-Audio</th>
1190
+ <th style="text-align:center; padding: 8px;">Gemini-2.5-Flash</th>
1191
+ <th style="text-align:center; padding: 8px;">Gemini-2.5-Pro</th>
1192
+ <th style="text-align:center; padding: 8px;">Qwen2.5-Omni</th>
1193
+ <th style="text-align:center; padding: 8px;">Qwen3-Omni-30B-A3B-Instruct</th>
1194
+ <th style="text-align:center; padding: 8px;">Qwen3-Omni-30B-A3B-Thinking</th>
1195
+ <th style="text-align:center; padding: 8px;">Qwen3-Omni-Flash-Instruct</th>
1196
+ <th style="text-align:center; padding: 8px;">Qwen3-Omni-Flash-Thinking</th>
1197
+ </tr>
1198
+ </thead>
1199
+ <tbody>
1200
+ <tr>
1201
+ <td colspan="9" align="center" style="padding: 8px; font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;"><strong>VoiceBench</strong></td>
1202
+ </tr>
1203
+ <tr>
1204
+ <td style="text-align:left; padding: 8px;">AlpacaEval</td>
1205
+ <td style="text-align:center; padding: 8px;">95.6</td>
1206
+ <td style="text-align:center; padding: 8px;">96.1</td>
1207
+ <td style="text-align:center; padding: 8px;">94.3</td>
1208
+ <td style="text-align:center; padding: 8px;">89.9</td>
1209
+ <td style="text-align:center; padding: 8px;">94.8</td>
1210
+ <td style="text-align:center; padding: 8px;">96.4</td>
1211
+ <td style="text-align:center; padding: 8px;">95.4</td>
1212
+ <td style="text-align:center; padding: 8px;"><strong>96.8</strong></td>
1213
+ </tr>
1214
+ <tr>
1215
+ <td style="text-align:left; padding: 8px;">CommonEval</td>
1216
+ <td style="text-align:center; padding: 8px;">89.8</td>
1217
+ <td style="text-align:center; padding: 8px;">88.3</td>
1218
+ <td style="text-align:center; padding: 8px;">88.4</td>
1219
+ <td style="text-align:center; padding: 8px;">76.7</td>
1220
+ <td style="text-align:center; padding: 8px;">90.8</td>
1221
+ <td style="text-align:center; padding: 8px;">90.5</td>
1222
+ <td style="text-align:center; padding: 8px;"><strong>91.0</strong></td>
1223
+ <td style="text-align:center; padding: 8px;">90.9</td>
1224
+ </tr>
1225
+ <tr>
1226
+ <td style="text-align:left; padding: 8px;">WildVoice</td>
1227
+ <td style="text-align:center; padding: 8px;">91.6</td>
1228
+ <td style="text-align:center; padding: 8px;">92.1</td>
1229
+ <td style="text-align:center; padding: 8px;">93.4</td>
1230
+ <td style="text-align:center; padding: 8px;">77.7</td>
1231
+ <td style="text-align:center; padding: 8px;">91.6</td>
1232
+ <td style="text-align:center; padding: 8px;">90.5</td>
1233
+ <td style="text-align:center; padding: 8px;"><strong>92.3</strong></td>
1234
+ <td style="text-align:center; padding: 8px;">90.9</td>
1235
+ </tr>
1236
+ <tr>
1237
+ <td style="text-align:left; padding: 8px;">SD-QA</td>
1238
+ <td style="text-align:center; padding: 8px;">75.5</td>
1239
+ <td style="text-align:center; padding: 8px;">84.5</td>
1240
+ <td style="text-align:center; padding: 8px;"><strong>90.1</strong></td>
1241
+ <td style="text-align:center; padding: 8px;">56.4</td>
1242
+ <td style="text-align:center; padding: 8px;">76.9</td>
1243
+ <td style="text-align:center; padding: 8px;">78.1</td>
1244
+ <td style="text-align:center; padding: 8px;">76.8</td>
1245
+ <td style="text-align:center; padding: 8px;">78.5</td>
1246
+ </tr>
1247
+ <tr>
1248
+ <td style="text-align:left; padding: 8px;">MMSU</td>
1249
+ <td style="text-align:center; padding: 8px;">80.3</td>
1250
+ <td style="text-align:center; padding: 8px;">66.1</td>
1251
+ <td style="text-align:center; padding: 8px;">71.1</td>
1252
+ <td style="text-align:center; padding: 8px;">61.7</td>
1253
+ <td style="text-align:center; padding: 8px;">68.1</td>
1254
+ <td style="text-align:center; padding: 8px;">83.0</td>
1255
+ <td style="text-align:center; padding: 8px;">68.4</td>
1256
+ <td style="text-align:center; padding: 8px;"><strong>84.3</strong></td>
1257
+ </tr>
1258
+ <tr>
1259
+ <td style="text-align:left; padding: 8px;">OpenBookQA</td>
1260
+ <td style="text-align:center; padding: 8px;">89.2</td>
1261
+ <td style="text-align:center; padding: 8px;">56.9</td>
1262
+ <td style="text-align:center; padding: 8px;">92.3</td>
1263
+ <td style="text-align:center; padding: 8px;">80.9</td>
1264
+ <td style="text-align:center; padding: 8px;">89.7</td>
1265
+ <td style="text-align:center; padding: 8px;">94.3</td>
1266
+ <td style="text-align:center; padding: 8px;">91.4</td>
1267
+ <td style="text-align:center; padding: 8px;"><strong>95.0</strong></td>
1268
+ </tr>
1269
+ <tr>
1270
+ <td style="text-align:left; padding: 8px;">BBH</td>
1271
+ <td style="text-align:center; padding: 8px;">84.1</td>
1272
+ <td style="text-align:center; padding: 8px;">83.9</td>
1273
+ <td style="text-align:center; padding: 8px;"><strong>92.6</strong></td>
1274
+ <td style="text-align:center; padding: 8px;">66.7</td>
1275
+ <td style="text-align:center; padding: 8px;">80.4</td>
1276
+ <td style="text-align:center; padding: 8px;">88.9</td>
1277
+ <td style="text-align:center; padding: 8px;">80.6</td>
1278
+ <td style="text-align:center; padding: 8px;">89.6</td>
1279
+ </tr>
1280
+ <tr>
1281
+ <td style="text-align:left; padding: 8px;">IFEval</td>
1282
+ <td style="text-align:center; padding: 8px;">76.0</td>
1283
+ <td style="text-align:center; padding: 8px;">83.8</td>
1284
+ <td style="text-align:center; padding: 8px;"><strong>85.7</strong></td>
1285
+ <td style="text-align:center; padding: 8px;">53.5</td>
1286
+ <td style="text-align:center; padding: 8px;">77.8</td>
1287
+ <td style="text-align:center; padding: 8px;">80.6</td>
1288
+ <td style="text-align:center; padding: 8px;">75.2</td>
1289
+ <td style="text-align:center; padding: 8px;">80.8</td>
1290
+ </tr>
1291
+ <tr>
1292
+ <td style="text-align:left; padding: 8px;">AdvBench</td>
1293
+ <td style="text-align:center; padding: 8px;">98.7</td>
1294
+ <td style="text-align:center; padding: 8px;">98.9</td>
1295
+ <td style="text-align:center; padding: 8px;">98.1</td>
1296
+ <td style="text-align:center; padding: 8px;">99.2</td>
1297
+ <td style="text-align:center; padding: 8px;"><strong>99.3</strong></td>
1298
+ <td style="text-align:center; padding: 8px;">97.2</td>
1299
+ <td style="text-align:center; padding: 8px;"><strong>99.4</strong></td>
1300
+ <td style="text-align:center; padding: 8px;">98.9</td>
1301
+ </tr>
1302
+ <tr>
1303
+ <td style="text-align:left; padding: 8px;">Overall</td>
1304
+ <td style="text-align:center; padding: 8px;">86.8</td>
1305
+ <td style="text-align:center; padding: 8px;">83.4</td>
1306
+ <td style="text-align:center; padding: 8px;"><strong>89.6</strong></td>
1307
+ <td style="text-align:center; padding: 8px;">73.6</td>
1308
+ <td style="text-align:center; padding: 8px;">85.5</td>
1309
+ <td style="text-align:center; padding: 8px;">88.8</td>
1310
+ <td style="text-align:center; padding: 8px;">85.6</td>
1311
+ <td style="text-align:center; padding: 8px;">89.5</td>
1312
+ </tr>
1313
+ <tr>
1314
+ <td colspan="9" align="center" style="padding: 8px; font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;"><strong>Audio Reasoning</strong></td>
1315
+ </tr>
1316
+ <tr>
1317
+ <td style="text-align:left; padding: 8px;">MMAU-v05.15.25</td>
1318
+ <td style="text-align:center; padding: 8px;">62.5</td>
1319
+ <td style="text-align:center; padding: 8px;">71.8</td>
1320
+ <td style="text-align:center; padding: 8px;">77.4</td>
1321
+ <td style="text-align:center; padding: 8px;">65.5</td>
1322
+ <td style="text-align:center; padding: 8px;">77.5</td>
1323
+ <td style="text-align:center; padding: 8px;">75.4</td>
1324
+ <td style="text-align:center; padding: 8px;"><strong>77.6</strong></td>
1325
+ <td style="text-align:center; padding: 8px;">76.5</td>
1326
+ </tr>
1327
+ <tr">
1328
+ <td style="text-align:left; padding: 8px;">MMSU</td>
1329
+ <td style="text-align:center; padding: 8px;">56.4</td>
1330
+ <td style="text-align:center; padding: 8px;">70.2</td>
1331
+ <td style="text-align:center; padding: 8px;"><strong>77.7</strong></td>
1332
+ <td style="text-align:center; padding: 8px;">62.6</td>
1333
+ <td style="text-align:center; padding: 8px;">69.0</td>
1334
+ <td style="text-align:center; padding: 8px;">70.2</td>
1335
+ <td style="text-align:center; padding: 8px;">69.1</td>
1336
+ <td style="text-align:center; padding: 8px;">71.3</td>
1337
+ </tr>
1338
+ </tbody>
1339
+ </table>
1340
+
1341
+ <table>
1342
+ <thead>
1343
+ <tr style="border-bottom: 1px solid black;">
1344
+ <th style="text-align: left;"></th>
1345
+ <th style="text-align: center;">Best Specialist<br>Models</th>
1346
+ <th style="text-align: center;">GPT-4o-Audio</th>
1347
+ <th style="text-align: center;">Gemini-2.5-Pro</th>
1348
+ <th style="text-align: center;">Qwen2.5-Omni</th>
1349
+ <th style="text-align: center;">Qwen3-Omni-30B-A3B-Instruct</th>
1350
+ <th style="text-align: center;">Qwen3-Omni-Flash-Instruct</th>
1351
+ </tr>
1352
+ </thead>
1353
+ <tbody>
1354
+ <tr>
1355
+ <td style="text-align: left;">RUL-MuchoMusic</td>
1356
+ <td style="text-align: center;">47.6 (Audio Flamingo 3)</td>
1357
+ <td style="text-align: center;">36.1</td>
1358
+ <td style="text-align: center;">49.4</td>
1359
+ <td style="text-align: center;">47.3</td>
1360
+ <td style="text-align: center;">52.0</td>
1361
+ <td style="text-align: center;"><strong>52.1</strong></td>
1362
+ </tr>
1363
+ <tr>
1364
+ <td style="text-align: left;">GTZAN<br><em>Acc.</em></td>
1365
+ <td style="text-align: center;">87.9 (CLaMP 3)</td>
1366
+ <td style="text-align: center;">76.5</td>
1367
+ <td style="text-align: center;">81.0</td>
1368
+ <td style="text-align: center;">81.7</td>
1369
+ <td style="text-align: center;">93.0</td>
1370
+ <td style="text-align: center;"><strong>93.1</strong></td>
1371
+ </tr>
1372
+ <tr>
1373
+ <td style="text-align: left;">MTG Genre<br><em>Micro F1</em></td>
1374
+ <td style="text-align: center;">35.8 (MuQ-MuLan)</td>
1375
+ <td style="text-align: center;">25.3</td>
1376
+ <td style="text-align: center;">32.6</td>
1377
+ <td style="text-align: center;">32.5</td>
1378
+ <td style="text-align: center;">39.0</td>
1379
+ <td style="text-align: center;"><strong>39.5</strong></td>
1380
+ </tr>
1381
+ <tr>
1382
+ <td style="text-align: left;">MTG Mood/Theme<br><em>Micro F1</em></td>
1383
+ <td style="text-align: center;">10.9 (MuQ-MuLan)</td>
1384
+ <td style="text-align: center;">11.3</td>
1385
+ <td style="text-align: center;">14.1</td>
1386
+ <td style="text-align: center;">8.9</td>
1387
+ <td style="text-align: center;">21.0</td>
1388
+ <td style="text-align: center;"><strong>21.7</strong></td>
1389
+ </tr>
1390
+ <tr>
1391
+ <td style="text-align: left;">MTG Instrument<br><em>Micro F1</em></td>
1392
+ <td style="text-align: center;">39.8 (MuQ-MuLan)</td>
1393
+ <td style="text-align: center;">34.2</td>
1394
+ <td style="text-align: center;">33.0</td>
1395
+ <td style="text-align: center;">22.6</td>
1396
+ <td style="text-align: center;">40.5</td>
1397
+ <td style="text-align: center;"><strong>40.7</strong></td>
1398
+ </tr>
1399
+ <tr>
1400
+ <td style="text-align: left;">MTG Top50<br><em>Micro F1</em></td>
1401
+ <td style="text-align: center;">33.2 (MuQ-MuLan)</td>
1402
+ <td style="text-align: center;">25.0</td>
1403
+ <td style="text-align: center;">26.1</td>
1404
+ <td style="text-align: center;">21.6</td>
1405
+ <td style="text-align: center;">36.7</td>
1406
+ <td style="text-align: center;"><strong>36.9</strong></td>
1407
+ </tr>
1408
+ <tr>
1409
+ <td style="text-align: left;">MagnaTagATune<br><em>Micro F1</em></td>
1410
+ <td style="text-align: center;">41.6 (MuQ)</td>
1411
+ <td style="text-align: center;">29.2</td>
1412
+ <td style="text-align: center;">28.1</td>
1413
+ <td style="text-align: center;">30.1</td>
1414
+ <td style="text-align: center;">44.3</td>
1415
+ <td style="text-align: center;"><strong>46.8</strong></td>
1416
+ </tr>
1417
+ </tbody>
1418
+ </table>
1419
+
1420
+ </details>
1421
+
1422
+ <details>
1423
+ <summary>Vision -> Text</summary>
1424
+
1425
+ <table style="width:100%; border-collapse: collapse;">
1426
+ <thead>
1427
+ <tr style="border-bottom: 1px solid black;">
1428
+ <th style="text-align: left;">Datasets</th>
1429
+ <th style="text-align: center;">GPT4-o</th>
1430
+ <th style="text-align: center;">Gemini-2.0-Flash</th>
1431
+ <th style="text-align: center;">Qwen2.5-VL<br>72B</th>
1432
+ <th style="text-align: center;">Qwen3-Omni-30B-A3B<br>-Instruct</th>
1433
+ <th style="text-align: center;">Qwen3-Omni-Flash<br>-Instruct</th>
1434
+ </tr>
1435
+ </thead>
1436
+ <tbody>
1437
+ <tr>
1438
+ <td colspan="6" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">General Visual Question Answering</td>
1439
+ </tr>
1440
+ <tr>
1441
+ <td style="text-align: left;">MMStar</td>
1442
+ <td style="text-align: center;">64.7</td>
1443
+ <td style="text-align: center;"><strong>71.4</strong></td>
1444
+ <td style="text-align: center;">70.8</td>
1445
+ <td style="text-align: center;">68.5</td>
1446
+ <td style="text-align: center;">69.3</td>
1447
+ </tr>
1448
+ <tr>
1449
+ <td style="text-align: left;">HallusionBench</td>
1450
+ <td style="text-align: center;">55.0</td>
1451
+ <td style="text-align: center;">56.3</td>
1452
+ <td style="text-align: center;">55.2</td>
1453
+ <td style="text-align: center;"><strong>59.7</strong></td>
1454
+ <td style="text-align: center;">58.5</td>
1455
+ </tr>
1456
+ <tr>
1457
+ <td style="text-align: left;">MM-MT-Bench</td>
1458
+ <td style="text-align: center;"><strong>7.7</strong></td>
1459
+ <td style="text-align: center;">6.7</td>
1460
+ <td style="text-align: center;">7.6</td>
1461
+ <td style="text-align: center;">7.4</td>
1462
+ <td style="text-align: center;">7.6</td>
1463
+ </tr>
1464
+ <tr>
1465
+ <td colspan="6" align="center" style="font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Math & STEM</td>
1466
+ </tr>
1467
+ <tr>
1468
+ <td style="text-align: left;">MMMU_val</td>
1469
+ <td style="text-align: center;">69.1</td>
1470
+ <td style="text-align: center;"><strong>71.3</strong></td>
1471
+ <td style="text-align: center;">70.2</td>
1472
+ <td style="text-align: center;">69.1</td>
1473
+ <td style="text-align: center;">69.8</td>
1474
+ </tr>
1475
+ <tr>
1476
+ <td style="text-align: left;">MMMU_pro</td>
1477
+ <td style="text-align: center;">51.9</td>
1478
+ <td style="text-align: center;">56.1</td>
1479
+ <td style="text-align: center;">51.1</td>
1480
+ <td style="text-align: center;">57.0</td>
1481
+ <td style="text-align: center;"><strong>57.6</strong></td>
1482
+ </tr>
1483
+ <tr>
1484
+ <td style="text-align: left;">MathVista_mini</td>
1485
+ <td style="text-align: center;">63.8</td>
1486
+ <td style="text-align: center;">71.4</td>
1487
+ <td style="text-align: center;">74.8</td>
1488
+ <td style="text-align: center;">75.9</td>
1489
+ <td style="text-align: center;"><strong>77.4</strong></td>
1490
+ </tr>
1491
+ <tr>
1492
+ <td style="text-align: left;">MathVision_full</td>
1493
+ <td style="text-align: center;">30.4</td>
1494
+ <td style="text-align: center;">48.6</td>
1495
+ <td style="text-align: center;">38.1</td>
1496
+ <td style="text-align: center;">56.3</td>
1497
+ <td style="text-align: center;"><strong>58.3</strong></td>
1498
+ </tr>
1499
+ <tr>
1500
+ <td colspan="6" align="center" style="font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Documentation Understanding</td>
1501
+ </tr>
1502
+ <tr>
1503
+ <td style="text-align: left;">AI2D</td>
1504
+ <td style="text-align: center;">84.6</td>
1505
+ <td style="text-align: center;">86.7</td>
1506
+ <td style="text-align: center;"><strong>88.7</strong></td>
1507
+ <td style="text-align: center;">85.2</td>
1508
+ <td style="text-align: center;">86.4</td>
1509
+ </tr>
1510
+ <tr>
1511
+ <td style="text-align: left;">ChartQA_test</td>
1512
+ <td style="text-align: center;">86.7</td>
1513
+ <td style="text-align: center;">64.6</td>
1514
+ <td style="text-align: center;"><strong>89.5</strong></td>
1515
+ <td style="text-align: center;">86.8</td>
1516
+ <td style="text-align: center;">87.1</td>
1517
+ </tr>
1518
+ <tr>
1519
+ <td colspan="6" align="center" style="font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Counting</td>
1520
+ </tr>
1521
+ <tr>
1522
+ <td style="text-align: left;">CountBench</td>
1523
+ <td style="text-align: center;">87.9</td>
1524
+ <td style="text-align: center;">91.2</td>
1525
+ <td style="text-align: center;"><strong>93.6</strong></td>
1526
+ <td style="text-align: center;">90.0</td>
1527
+ <td style="text-align: center;">90.0</td>
1528
+ </tr>
1529
+ <tr>
1530
+ <td colspan="6" align="center" style="font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Video Understanding</td>
1531
+ </tr>
1532
+ <tr>
1533
+ <td style="text-align: left;">Video-MME</td>
1534
+ <td style="text-align: center;">71.9</td>
1535
+ <td style="text-align: center;">72.4</td>
1536
+ <td style="text-align: center;"><strong>73.3</strong></td>
1537
+ <td style="text-align: center;">70.5</td>
1538
+ <td style="text-align: center;">71.4</td>
1539
+ </tr>
1540
+ <tr>
1541
+ <td style="text-align: left;">LVBench</td>
1542
+ <td style="text-align: center;">30.8</td>
1543
+ <td style="text-align: center;"><strong>57.9</strong></td>
1544
+ <td style="text-align: center;">47.3</td>
1545
+ <td style="text-align: center;">50.2</td>
1546
+ <td style="text-align: center;">51.1</td>
1547
+ </tr>
1548
+ <tr>
1549
+ <td style="text-align: left;">MLVU</td>
1550
+ <td style="text-align: center;">64.6</td>
1551
+ <td style="text-align: center;">71.0</td>
1552
+ <td style="text-align: center;">74.6</td>
1553
+ <td style="text-align: center;">75.2</td>
1554
+ <td style="text-align: center;"><strong>75.7</strong></td>
1555
+ </tr>
1556
+ </tbody>
1557
+ </table>
1558
+
1559
+ <table style="width: 100%; border-collapse: collapse;">
1560
+ <thead style="border-bottom: 1px solid black;">
1561
+ <tr>
1562
+ <th align="left" style="padding: 6px;">Datasets</th>
1563
+ <th align="center" style="padding: 6px;">Gemini-2.5-flash-thinking</th>
1564
+ <th align="center" style="padding: 6px;">InternVL-3.5-241B-A28B</th>
1565
+ <th align="center" style="padding: 6px;">Qwen3-Omni-30B-A3B-Thinking</th>
1566
+ <th align="center" style="padding: 6px;">Qwen3-Omni-Flash-Thinking</th>
1567
+ </tr>
1568
+ </thead>
1569
+ <tbody>
1570
+ <tr style="border-top: 2px solid black; border-bottom: 1px solid #ccc;">
1571
+ <td colspan="5" align="center" style="padding: 6px 0; font-weight: bold; border-bottom: 1px solid black;">General Visual Question Answering</td>
1572
+ </tr>
1573
+ <tr>
1574
+ <td style="padding: 6px;">MMStar</td>
1575
+ <td align="center" style="padding: 6px;">75.5</td>
1576
+ <td align="center" style="padding: 6px;"><b>77.9</b></td>
1577
+ <td align="center" style="padding: 6px;">74.9</td>
1578
+ <td align="center" style="padding: 6px;">75.5</td>
1579
+ </tr>
1580
+ <tr>
1581
+ <td style="padding: 6px;">HallusionBench</td>
1582
+ <td align="center" style="padding: 6px;">61.1</td>
1583
+ <td align="center" style="padding: 6px;">57.3</td>
1584
+ <td align="center" style="padding: 6px;">62.8</td>
1585
+ <td align="center" style="padding: 6px;"><b>63.4</b></td>
1586
+ </tr>
1587
+ <tr>
1588
+ <td style="padding: 6px;">MM-MT-Bench</td>
1589
+ <td align="center" style="padding: 6px;">7.8</td>
1590
+ <td align="center" style="padding: 6px;">–</td>
1591
+ <td align="center" style="padding: 6px;"><b>8.0</b></td>
1592
+ <td align="center" style="padding: 6px;"><b>8.0</b></td>
1593
+ </tr>
1594
+ <tr style="border-top: 1px solid black; border-bottom: 1px solid #ccc;">
1595
+ <td colspan="5" align="center" style="padding: 6px 0; font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Math & STEM</td>
1596
+ </tr>
1597
+ <tr>
1598
+ <td style="padding: 6px;">MMMU_val</td>
1599
+ <td align="center" style="padding: 6px;">76.9</td>
1600
+ <td align="center" style="padding: 6px;"><b>77.7</b></td>
1601
+ <td align="center" style="padding: 6px;">75.6</td>
1602
+ <td align="center" style="padding: 6px;">75.0</td>
1603
+ </tr>
1604
+ <tr>
1605
+ <td style="padding: 6px;">MMMU_pro</td>
1606
+ <td align="center" style="padding: 6px;"><b>65.8</b></td>
1607
+ <td align="center" style="padding: 6px;">–</td>
1608
+ <td align="center" style="padding: 6px;">60.5</td>
1609
+ <td align="center" style="padding: 6px;">60.8</td>
1610
+ </tr>
1611
+ <tr>
1612
+ <td style="padding: 6px;">MathVista_mini</td>
1613
+ <td align="center" style="padding: 6px;">77.6</td>
1614
+ <td align="center" style="padding: 6px;"><b>82.7</b></td>
1615
+ <td align="center" style="padding: 6px;">80.0</td>
1616
+ <td align="center" style="padding: 6px;">81.2</td>
1617
+ </tr>
1618
+ <tr>
1619
+ <td style="padding: 6px;">MathVision_full</td>
1620
+ <td align="center" style="padding: 6px;">62.3</td>
1621
+ <td align="center" style="padding: 6px;"><b>63.9</b></td>
1622
+ <td align="center" style="padding: 6px;">62.9</td>
1623
+ <td align="center" style="padding: 6px;">63.8</td>
1624
+ </tr>
1625
+ <tr style="border-top: 1px solid black; border-bottom: 1px solid #ccc;">
1626
+ <td colspan="5" align="center" style="padding: 6px 0; font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Documentation Understanding</td>
1627
+ </tr>
1628
+ <tr>
1629
+ <td style="padding: 6px;">AI2D_test</td>
1630
+ <td align="center" style="padding: 6px;"><b>88.6</b></td>
1631
+ <td align="center" style="padding: 6px;">87.3</td>
1632
+ <td align="center" style="padding: 6px;">86.1</td>
1633
+ <td align="center" style="padding: 6px;">86.8</td>
1634
+ </tr>
1635
+ <tr>
1636
+ <td style="padding: 6px;">ChartQA_test</td>
1637
+ <td align="center" style="padding: 6px;">–</td>
1638
+ <td align="center" style="padding: 6px;">88.0</td>
1639
+ <td align="center" style="padding: 6px;"><b>89.5</b></td>
1640
+ <td align="center" style="padding: 6px;">89.3</td>
1641
+ </tr>
1642
+ <tr style="border-top: 1px solid black; border-bottom: 1px solid #ccc;">
1643
+ <td colspan="5" align="center" style="padding: 6px 0; font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Counting</td>
1644
+ </tr>
1645
+ <tr>
1646
+ <td style="padding: 6px;">CountBench</td>
1647
+ <td align="center" style="padding: 6px;">88.6</td>
1648
+ <td align="center" style="padding: 6px;">–</td>
1649
+ <td align="center" style="padding: 6px;">88.6</td>
1650
+ <td align="center" style="padding: 6px;"><b>92.5</b></td>
1651
+ </tr>
1652
+ <tr style="border-top: 1px solid black; border-bottom: 1px solid #ccc;">
1653
+ <td colspan="5" align="center" style="padding: 6px 0; font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Video Understanding</td>
1654
+ </tr>
1655
+ <tr>
1656
+ <td style="padding: 6px;">Video-MME</td>
1657
+ <td align="center" style="padding: 6px;"><b>79.6</b></td>
1658
+ <td align="center" style="padding: 6px;">72.9</td>
1659
+ <td align="center" style="padding: 6px;">69.7</td>
1660
+ <td align="center" style="padding: 6px;">69.8</td>
1661
+ </tr>
1662
+ <tr>
1663
+ <td style="padding: 6px;">LVBench</td>
1664
+ <td align="center" style="padding: 6px;"><b>64.5</b></td>
1665
+ <td align="center" style="padding: 6px;">–</td>
1666
+ <td align="center" style="padding: 6px;">49.0</td>
1667
+ <td align="center" style="padding: 6px;">49.5</td>
1668
+ </tr>
1669
+ <tr>
1670
+ <td style="padding: 6px;">MLVU</td>
1671
+ <td align="center" style="padding: 6px;"><b>82.1</b></td>
1672
+ <td align="center" style="padding: 6px;">78.2</td>
1673
+ <td align="center" style="padding: 6px;">72.9</td>
1674
+ <td align="center" style="padding: 6px;">73.9</td>
1675
+ </tr>
1676
+ </tbody>
1677
+ </table>
1678
+
1679
+ </details>
1680
+
1681
+ <details>
1682
+ <summary>AudioVisual -> Text</summary>
1683
+
1684
+ <table>
1685
+ <thead>
1686
+ <tr>
1687
+ <th>Datasets</th>
1688
+ <th>Previous Open-source SoTA</th>
1689
+ <th>Gemini-2.5-Flash</th>
1690
+ <th>Qwen2.5-Omni</th>
1691
+ <th>Qwen3-Omni-30B-A3B-Instruct</th>
1692
+ <th>Qwen3-Omni-Flash-Instruct</th>
1693
+ </tr>
1694
+ </thead>
1695
+ <tbody>
1696
+ <tr>
1697
+ <td>WorldSense</td>
1698
+ <td>47.1</td>
1699
+ <td>50.9</td>
1700
+ <td>45.4</td>
1701
+ <td>54.0</td>
1702
+ <td><strong>54.1</strong></td>
1703
+ </tr>
1704
+ </tbody>
1705
+ </table>
1706
+
1707
+ <table>
1708
+ <thead>
1709
+ <tr>
1710
+ <th>Datasets</th>
1711
+ <th>Previous Open-source SoTA</th>
1712
+ <th>Gemini-2.5-Flash-Thinking</th>
1713
+ <th>Qwen3-Omni-30B-A3B-Thinking</th>
1714
+ <th>Qwen3-Omni-Flash-Thinking</th>
1715
+ </tr>
1716
+ </thead>
1717
+ <tbody>
1718
+ <tr>
1719
+ <td>DailyOmni</td>
1720
+ <td>69.8</td>
1721
+ <td>72.7</td>
1722
+ <td>75.8</b></td>
1723
+ <td><b>76.2</td>
1724
+ </tr>
1725
+ <tr>
1726
+ <td>VideoHolmes</td>
1727
+ <td>55.6</td>
1728
+ <td>49.5</td>
1729
+ <td><b>57.3</b></td>
1730
+ <td><b>57.3</b></td>
1731
+ </tr>
1732
+ </tbody>
1733
+ </table>
1734
+
1735
+ </details>
1736
+
1737
+
1738
+ <details>
1739
+ <summary>Zero-shot Speech Generation</summary>
1740
+
1741
+ <table>
1742
+ <thead>
1743
+ <tr>
1744
+ <th align="left">Datasets</th>
1745
+ <th align="left">Model</th>
1746
+ <th align="left">Performance</th>
1747
+ </tr>
1748
+ </thead>
1749
+ <tbody>
1750
+ <tr>
1751
+ <td>&nbsp;</td>
1752
+ <td colspan="2" align="center"><em>Content Consistency</em></td>
1753
+ </tr>
1754
+ </tbody>
1755
+ <tbody>
1756
+ <tr>
1757
+ <td rowspan="10" align="center" valign="middle"><strong>SEED</strong><br><em>test-zh</em> | <em>test-en</em></td>
1758
+ <td align="left">Seed-TTS<sub>ICL</sub></td>
1759
+ <td align="left">1.11 | 2.24</td>
1760
+ </tr>
1761
+ <tr>
1762
+ <td align="left">Seed-TTS<sub>RL</sub></td>
1763
+ <td align="left">1.00 | 1.94</td>
1764
+ </tr>
1765
+ <tr>
1766
+ <td align="left">MaskGCT</td>
1767
+ <td align="left">2.27 | 2.62</td>
1768
+ </tr>
1769
+ <tr>
1770
+ <td align="left">E2 TTS</td>
1771
+ <td align="left">1.97 | 2.19</td>
1772
+ </tr>
1773
+ <tr>
1774
+ <td align="left">F5-TTS</td>
1775
+ <td align="left">1.56 | 1.83</td>
1776
+ </tr>
1777
+ <tr>
1778
+ <td align="left">Spark TTS</td>
1779
+ <td align="left">1.20 | 1.98</td>
1780
+ </tr>
1781
+ <tr>
1782
+ <td align="left">CosyVoice 2</td>
1783
+ <td align="left">1.45 | 2.57</td>
1784
+ </tr>
1785
+ <tr>
1786
+ <td align="left">CosyVoice 3</td>
1787
+ <td align="left"><strong>0.71</strong> | 1.45</td>
1788
+ </tr>
1789
+ <tr>
1790
+ <td align="left">Qwen2.5-Omni-7B</td>
1791
+ <td align="left">1.42 | 2.33</td>
1792
+ </tr>
1793
+ <tr>
1794
+ <td align="left">Qwen3-Omni-30B-A3B</td>
1795
+ <td align="left">1.07 | <strong>1.39</strong></td>
1796
+ </tr>
1797
+ </tbody>
1798
+ </table>
1799
+
1800
+ </details>
1801
+
1802
+ <details>
1803
+ <summary>Multilingual Speech Generation </summary>
1804
+
1805
+ <table>
1806
+ <thead>
1807
+ <tr>
1808
+ <th rowspan="2" align="left">Language</th>
1809
+ <th colspan="3" style="text-align:center; padding: 8px; font-weight: bold; border-bottom: 1px solid #ddd;">Content Consistency</th>
1810
+ <th colspan="3" style="text-align:center; padding: 8px; font-weight: bold; border-bottom: 1px solid #ddd;">Speaker Similarity</th>
1811
+ </tr>
1812
+ <tr>
1813
+ <th align="center">Qwen3-Omni-30B-A3B</th>
1814
+ <th align="center">MiniMax</th>
1815
+ <th align="center">ElevenLabs</th>
1816
+ <th align="center">Qwen3-Omni-30B-A3B</th>
1817
+ <th align="center">MiniMax</th>
1818
+ <th align="center">ElevenLabs</th>
1819
+ </tr>
1820
+ </thead>
1821
+ <tbody>
1822
+ <tr>
1823
+ <td align="left">Chinese</td>
1824
+ <td align="center"><strong>0.716</strong></td>
1825
+ <td align="center">2.252</td>
1826
+ <td align="center">16.026</td>
1827
+ <td align="center">0.772</td>
1828
+ <td align="center"><strong>0.780</strong></td>
1829
+ <td align="center">0.677</td>
1830
+ </tr>
1831
+ <tr>
1832
+ <td align="left">English</td>
1833
+ <td align="center"><strong>1.069</strong></td>
1834
+ <td align="center">2.164</td>
1835
+ <td align="center">2.339</td>
1836
+ <td align="center"><strong>0.773</strong></td>
1837
+ <td align="center">0.756</td>
1838
+ <td align="center">0.613</td>
1839
+ </tr>
1840
+ <tr>
1841
+ <td align="left">German</td>
1842
+ <td align="center">0.777</td>
1843
+ <td align="center">1.906</td>
1844
+ <td align="center"><strong>0.572</strong></td>
1845
+ <td align="center"><strong>0.738</strong></td>
1846
+ <td align="center">0.733</td>
1847
+ <td align="center">0.614</td>
1848
+ </tr>
1849
+ <tr>
1850
+ <td align="left">Italian</td>
1851
+ <td align="center"><strong>1.067</strong></td>
1852
+ <td align="center">1.543</td>
1853
+ <td align="center">1.743</td>
1854
+ <td align="center"><strong>0.742</strong></td>
1855
+ <td align="center">0.699</td>
1856
+ <td align="center">0.579</td>
1857
+ </tr>
1858
+ <tr>
1859
+ <td align="left">Portuguese</td>
1860
+ <td align="center">1.872</td>
1861
+ <td align="center">1.877</td>
1862
+ <td align="center"><strong>1.331</strong></td>
1863
+ <td align="center">0.770</td>
1864
+ <td align="center"><strong>0.805</strong></td>
1865
+ <td align="center">0.711</td>
1866
+ </tr>
1867
+ <tr>
1868
+ <td align="left">Spanish</td>
1869
+ <td align="center">1.765</td>
1870
+ <td align="center"><strong>1.029</strong></td>
1871
+ <td align="center">1.084</td>
1872
+ <td align="center">0.744</td>
1873
+ <td align="center"><strong>0.762</strong></td>
1874
+ <td align="center">0.615</td>
1875
+ </tr>
1876
+ <tr>
1877
+ <td align="left">Japanese</td>
1878
+ <td align="center">3.631</td>
1879
+ <td align="center"><strong>3.519</strong></td>
1880
+ <td align="center">10.646</td>
1881
+ <td align="center">0.763</td>
1882
+ <td align="center"><strong>0.776</strong></td>
1883
+ <td align="center">0.738</td>
1884
+ </tr>
1885
+ <tr>
1886
+ <td align="left">Korean</td>
1887
+ <td align="center"><strong>1.670</strong></td>
1888
+ <td align="center">1.747</td>
1889
+ <td align="center">1.865</td>
1890
+ <td align="center"><strong>0.778</strong></td>
1891
+ <td align="center">0.776</td>
1892
+ <td align="center">0.700</td>
1893
+ </tr>
1894
+ <tr>
1895
+ <td align="left">French</td>
1896
+ <td align="center"><strong>2.505</strong></td>
1897
+ <td align="center">4.099</td>
1898
+ <td align="center">5.216</td>
1899
+ <td align="center"><strong>0.689</strong></td>
1900
+ <td align="center">0.628</td>
1901
+ <td align="center">0.535</td>
1902
+ </tr>
1903
+ <tr>
1904
+ <td align="left">Russian</td>
1905
+ <td align="center">3.986</td>
1906
+ <td align="center">4.281</td>
1907
+ <td align="center"><strong>3.878</strong></td>
1908
+ <td align="center">0.759</td>
1909
+ <td align="center"><strong>0.761</strong></td>
1910
+ <td align="center">0.676</td>
1911
+ </tr>
1912
+ </tbody>
1913
+ </table>
1914
+
1915
+ </details>
1916
+
1917
+ <details>
1918
+ <summary>Cross-Lingual Speech Generation </summary>
1919
+
1920
+ <table>
1921
+ <thead>
1922
+ <tr>
1923
+ <th style="text-align: left;">Language</th>
1924
+ <th style="text-align: left;">Qwen3-Omni-30B-A3B</th>
1925
+ <th style="text-align: left;">CosyVoice3</th>
1926
+ <th style="text-align: left;">CosyVoice2</th>
1927
+ </tr>
1928
+ </thead>
1929
+ <tbody>
1930
+ <tr>
1931
+ <td style="text-align: left;">en-to-zh</td>
1932
+ <td style="text-align: left;">5.37</td>
1933
+ <td style="text-align: left;"><strong>5.09</strong></td>
1934
+ <td style="text-align: left;">13.5</td>
1935
+ </tr>
1936
+ <tr>
1937
+ <td style="text-align: left;">ja-to-zh</td>
1938
+ <td style="text-align: left;">3.32</td>
1939
+ <td style="text-align: left;"><strong>3.05</strong></td>
1940
+ <td style="text-align: left;">48.1</td>
1941
+ </tr>
1942
+ <tr>
1943
+ <td style="text-align: left;">ko-to-zh</td>
1944
+ <td style="text-align: left;"><strong>0.99</strong></td>
1945
+ <td style="text-align: left;">1.06</td>
1946
+ <td style="text-align: left;">7.70</td>
1947
+ </tr>
1948
+ <tr>
1949
+ <td style="text-align: left;">zh-to-en</td>
1950
+ <td style="text-align: left;"><strong>2.76</strong></td>
1951
+ <td style="text-align: left;">2.98</td>
1952
+ <td style="text-align: left;">6.47</td>
1953
+ </tr>
1954
+ <tr>
1955
+ <td style="text-align: left;">ja-to-en</td>
1956
+ <td style="text-align: left;"><strong>3.31</strong></td>
1957
+ <td style="text-align: left;">4.20</td>
1958
+ <td style="text-align: left;">17.1</td>
1959
+ </tr>
1960
+ <tr>
1961
+ <td style="text-align: left;">ko-to-en</td>
1962
+ <td style="text-align: left;"><strong>3.34</strong></td>
1963
+ <td style="text-align: left;">4.19</td>
1964
+ <td style="text-align: left;">11.2</td>
1965
+ </tr>
1966
+ <tr>
1967
+ <td style="text-align: left;">zh-to-ja</td>
1968
+ <td style="text-align: left;">8.29</td>
1969
+ <td style="text-align: left;"><strong>7.08</strong></td>
1970
+ <td style="text-align: left;">13.1</td>
1971
+ </tr>
1972
+ <tr>
1973
+ <td style="text-align: left;">en-to-ja</td>
1974
+ <td style="text-align: left;">7.53</td>
1975
+ <td style="text-align: left;"><strong>6.80</strong></td>
1976
+ <td style="text-align: left;">14.9</td>
1977
+ </tr>
1978
+ <tr>
1979
+ <td style="text-align: left;">ko-to-ja</td>
1980
+ <td style="text-align: left;">4.24</td>
1981
+ <td style="text-align: left;"><strong>3.93</strong></td>
1982
+ <td style="text-align: left;">5.86</td>
1983
+ </tr>
1984
+ <tr>
1985
+ <td style="text-align: left;">zh-to-ko</td>
1986
+ <td style="text-align: left;"><strong>5.13</strong></td>
1987
+ <td style="text-align: left;">14.4</td>
1988
+ <td style="text-align: left;">24.8</td>
1989
+ </tr>
1990
+ <tr>
1991
+ <td style="text-align: left;">en-to-ko</td>
1992
+ <td style="text-align: left;"><strong>4.96</strong></td>
1993
+ <td style="text-align: left;">5.87</td>
1994
+ <td style="text-align: left;">21.9</td>
1995
+ </tr>
1996
+ <tr>
1997
+ <td style="text-align: left;">ja-to-ko</td>
1998
+ <td style="text-align: left;"><strong>6.23</strong></td>
1999
+ <td style="text-align: left;">7.92</td>
2000
+ <td style="text-align: left;">21.5</td>
2001
+ </tr>
2002
+ </tbody>
2003
+ </table>
2004
+
2005
+ </details>
2006
+
2007
+
2008
+ ### Setting for Evaluation
2009
+
2010
+ * **Decoding Strategy**: For the Qwen3-Omni series across all evaluation benchmarks, `Instruct` models use greedy decoding during generation without sampling. For `Thinking` models, the decoding parameters should be taken from the `generation_config.json` file in the checkpoint.
2011
+ * **Benchmark-Specific Formatting**: For the majority of evaluation benchmarks, they come with their own ChatML formatting to embed the question or prompt. It should be noted that all video data are set to `fps=2` during evaluation.
2012
+ * **Default Prompts**: For tasks in certain benchmarks that do not include a prompt, we use the following prompt settings:
2013
+
2014
+ | Task Type | Prompt |
2015
+ | :--- | :--- |
2016
+ | Auto Speech Recognition (ASR) for Chinese | 请将这段中文语音转换为纯文本。 |
2017
+ | Auto Speech Recognition (ASR) for Other languages | Transcribe the <language> audio into text. |
2018
+ | Speech-to-Text Translation (S2TT) | Listen to the provided <source_language> speech and produce a translation in <target_language> text. |
2019
+ | Song Lyrics Recognition | Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations. |
2020
+
2021
+ * **System Prompt**: No `system prompt` should be set for any evaluation benchmark.
2022
+ * **Input Sequence**: The question or prompt should be input as user text. Unless otherwise specified by the benchmark, the text should come **after** multimodal data in the sequence. For example:
2023
+
2024
+ ```python
2025
+ messages = [
2026
+ {
2027
+ "role": "user",
2028
+ "content": [
2029
+ {"type": "audio", "audio": "/path/to/audio.wav"},
2030
+ {"type": "image", "image": "/path/to/image.png"},
2031
+ {"type": "video", "video": "/path/to/video.mp4"},
2032
+ {"type": "text", "text": "Describe the audio, image and video."},
2033
+ ],
2034
+ },
2035
+ ]
2036
+ ```
2037
+
2038
+
2039
+ <!-- ## Citation
2040
+
2041
+ If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
2042
+
2043
+
2044
+ ```BibTeX
2045
+ @article{Qwen3-Omni,
2046
+ title={Qwen3-Omni Technical Report},
2047
+ author={Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin},
2048
+ journal={arXiv preprint arXiv},
2049
+ year={2025}
2050
+ }
2051
+ ``` -->
2052
+
2053
+ <br>
added_tokens.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<tts_pad>": 151671,
9
+ "<tts_text_bos>": 151672,
10
+ "<tts_text_bos_single>": 151674,
11
+ "<tts_text_eod>": 151673,
12
+ "<|audio_end|>": 151670,
13
+ "<|audio_pad|>": 151675,
14
+ "<|audio_start|>": 151669,
15
+ "<|box_end|>": 151649,
16
+ "<|box_start|>": 151648,
17
+ "<|endoftext|>": 151643,
18
+ "<|file_sep|>": 151664,
19
+ "<|fim_middle|>": 151660,
20
+ "<|fim_pad|>": 151662,
21
+ "<|fim_prefix|>": 151659,
22
+ "<|fim_suffix|>": 151661,
23
+ "<|im_end|>": 151645,
24
+ "<|im_start|>": 151644,
25
+ "<|image_pad|>": 151655,
26
+ "<|object_ref_end|>": 151647,
27
+ "<|object_ref_start|>": 151646,
28
+ "<|quad_end|>": 151651,
29
+ "<|quad_start|>": 151650,
30
+ "<|repo_name|>": 151663,
31
+ "<|video_pad|>": 151656,
32
+ "<|vision_end|>": 151653,
33
+ "<|vision_pad|>": 151654,
34
+ "<|vision_start|>": 151652
35
+ }
chat_template.jinja ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
9
+ {{- "<|vision_start|><|image_pad|><|vision_end|>" }}
10
+ {%- elif content.type == 'audio' or 'audio' in content or 'audio_url' in content %}
11
+ {{- "<|audio_start|><|audio_pad|><|audio_end|>" }}
12
+ {%- elif content.type == 'video' or 'video' in content %}
13
+ {{- "<|vision_start|><|video_pad|><|vision_end|>" }}
14
+ {%- elif content.type == 'text' %}
15
+ {{- content.text }}
16
+ {%- endif %}
17
+ {%- endfor %}
18
+ {%- endif %}
19
+ {%- endif %}
20
+ {{- '\n\n' }}
21
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
22
+ {%- for tool in tools %}
23
+ {{- "\n" }}
24
+ {{- tool | tojson }}
25
+ {%- endfor %}
26
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
27
+ {%- else %}
28
+ {%- if messages[0].role == 'system' %}
29
+ {%- if messages[0].content is string %}
30
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
31
+ {%- else %}
32
+ {%- for content in messages[0].content %}
33
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
34
+ {{- '<|im_start|>system\n' +"<|vision_start|><|image_pad|><|vision_end|>"+ '<|im_end|>\n' }}
35
+ {%- elif content.type == 'audio' or 'audio' in content or 'audio_url' in content %}
36
+ {{- '<|im_start|>system\n' +"<|audio_start|><|audio_pad|><|audio_end|>"+ '<|im_end|>\n' }}
37
+ {%- elif content.type == 'video' or 'video' in content %}
38
+ {{- '<|im_start|>system\n' +"<|vision_start|><|video_pad|><|vision_end|>"+ '<|im_end|>\n' }}
39
+ {%- elif content.type == 'text' %}
40
+ {{- '<|im_start|>system\n' +content.text+ '<|im_end|>\n' }}
41
+ {%- endif %}
42
+ {%- endfor %}
43
+ {%- endif %}
44
+ {%- endif %}
45
+ {%- endif %}
46
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
47
+ {%- for message in messages[::-1] %}
48
+ {%- set index = (messages|length - 1) - loop.index0 %}
49
+ {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
50
+ {%- set ns.multi_step_tool = false %}
51
+ {%- set ns.last_query_index = index %}
52
+ {%- endif %}
53
+ {%- endfor %}
54
+ {%- for message in messages %}
55
+ {%- if message.content is string %}
56
+ {%- set content = message.content %}
57
+ {%- else %}
58
+ {%- set content = namespace(text="") %}
59
+ {%- for mcontent in message.content %}
60
+ {%- if mcontent.type == 'image' or 'image' in mcontent or 'image_url' in mcontent %}
61
+ {%- set content.text = content.text~"<|vision_start|><|image_pad|><|vision_end|>" %}
62
+ {%- elif mcontent.type == 'audio' or 'audio' in mcontent or 'audio_url' in mcontent %}
63
+ {%- set content.text = content.text~"<|audio_start|><|audio_pad|><|audio_end|>" %}
64
+ {%- elif mcontent.type == 'video' or 'video' in mcontent %}
65
+ {%- set content.text = content.text~"<|vision_start|><|video_pad|><|vision_end|>" %}
66
+ {%- elif mcontent.type == 'text' %}
67
+ {%- set content.text = content.text~mcontent.text %}
68
+ {%- endif %}
69
+ {%- endfor %}
70
+ {%- set content = content.text %}
71
+ {%- endif %}
72
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
73
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
74
+ {%- elif message.role == "assistant" %}
75
+ {%- set reasoning_content = "" %}
76
+ {%- if message.reasoning_content is string %}
77
+ {%- set reasoning_content = message.reasoning_content %}
78
+ {%- else %}
79
+ {%- if '</think>' in content %}
80
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
81
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
82
+ {%- endif %}
83
+ {%- endif %}
84
+ {%- if loop.index0 > ns.last_query_index %}
85
+ {%- if loop.last or (not loop.last and reasoning_content) %}
86
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip("\n") + '\n</think>\n\n' + content.lstrip('\n') }}
87
+ {%- else %}
88
+ {{- '<|im_start|>' + message.role + '\n' + content }}
89
+ {%- endif %}
90
+ {%- else %}
91
+ {{- '<|im_start|>' + message.role + '\n' + content }}
92
+ {%- endif %}
93
+ {%- if message.tool_calls %}
94
+ {%- for tool_call in message.tool_calls %}
95
+ {%- if (loop.first and content) or (not loop.first) %}{{- '\n' }}{%- endif %}
96
+ {%- if tool_call.function %}
97
+ {%- set tool_call = tool_call.function %}
98
+ {%- endif %}
99
+ {{- '<tool_call>\n{"name": "' }}
100
+ {{- tool_call.name }}
101
+ {{- '", "arguments": ' }}
102
+ {%- if tool_call.arguments is string %}
103
+ {{- tool_call.arguments }}
104
+ {%- else %}
105
+ {{- tool_call.arguments | tojson }}
106
+ {%- endif %}
107
+ {{- '}\n</tool_call>' }}
108
+ {%- endfor %}
109
+ {%- endif %}
110
+ {{- '<|im_end|>\n' }}
111
+ {%- elif message.role == "tool" %}
112
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}
113
+ {{- '\n<tool_response>\n' }}
114
+ {{- content }}
115
+ {{- '\n</tool_response>' }}
116
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}
117
+ {%- endif %}
118
+ {%- endfor %}
119
+ {%- if add_generation_prompt %}
120
+ {{- '<|im_start|>assistant\n' }}
121
+ {%- if enable_thinking is defined and enable_thinking is false %}{{- '<think>\n\n</think>\n\n' }}{%- endif %}
122
+ {%- endif %}
config.json ADDED
The diff for this file is too large to render. See raw diff
 
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "talker_max_new_tokens": 4096,
3
+ "talker_repetition_penalty": 1.05,
4
+ "talker_temperature": 0.9,
5
+ "talker_top_k": 50,
6
+ "talker_top_p": 1.0
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c2ca5e4c2806f54813e18493b9d603ce5bcb00bd76a7259e230be40a3bf00db
3
+ size 5000521000
model-00002-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:154a2415f7e3e4b87cf39102a3ceec640b810979ce1fbf3a1ee5210b8a9bd4fd
3
+ size 4454993674
model-00003-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5b34562d08a7d2117dcfee5d54f7188ad71c937b072a0de954867fd21d52202
3
+ size 4999493480
model-00004-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9c8ea4e1f765f4edf42231b73e97e5670c20bf8ec5653afafeb7a8ebf2f0832
3
+ size 5000582072
model-00005-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:914ac00960d66093d2ecc1fdef5962955a80bce9a9e3b341dfb361ad38707fce
3
+ size 5000055640
model-00006-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ae02a230d146a34a22dcad1504b4fef576d578c2e6632efa555c4f55f022495
3
+ size 5000586048
model-00007-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ecf7d4b31510f5f587f117241b3bc4aa7a5eb59dc1a036721f0fc50a142195f
3
+ size 5000055648
model-00008-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f74b908592181d03914ce79d061ab401ec03f37964eab3d3bb81c448d3d2894
3
+ size 5000586032
model-00009-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16a6d6be3194030ef94c70686bc5aff54525d7d583529731475f3c0a3994f540
3
+ size 3043130000
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length": 30,
3
+ "dither": 0.0,
4
+ "feature_extractor_type": "WhisperFeatureExtractor",
5
+ "feature_size": 128,
6
+ "hop_length": 160,
7
+ "image_mean": [
8
+ 0.5,
9
+ 0.5,
10
+ 0.5
11
+ ],
12
+ "image_processor_type": "Qwen2VLImageProcessor",
13
+ "image_std": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "max_pixels": 12845056,
19
+ "merge_size": 2,
20
+ "min_pixels": 3136,
21
+ "n_fft": 400,
22
+ "n_samples": 480000,
23
+ "nb_max_frames": 3000,
24
+ "padding_side": "right",
25
+ "padding_value": 0.0,
26
+ "patch_size": 16,
27
+ "processor_class": "Qwen3OmniMoeProcessor",
28
+ "return_attention_mask": true,
29
+ "sampling_rate": 16000,
30
+ "temporal_patch_size": 2
31
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>",
16
+ "<|audio_start|>",
17
+ "<|audio_end|>",
18
+ "<tts_pad>",
19
+ "<tts_text_bos>",
20
+ "<tts_text_bos_single>",
21
+ "<|audio_pad|>"
22
+ ],
23
+ "audio_bos_token": "<|audio_start|>",
24
+ "audio_eos_token": "<|audio_end|>",
25
+ "audio_token": "<|audio_pad|>",
26
+ "eos_token": {
27
+ "content": "<|im_end|>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ },
33
+ "image_token": "<|image_pad|>",
34
+ "pad_token": {
35
+ "content": "<|endoftext|>",
36
+ "lstrip": false,
37
+ "normalized": false,
38
+ "rstrip": false,
39
+ "single_word": false
40
+ },
41
+ "video_token": "<|video_pad|>",
42
+ "vision_bos_token": "<|vision_start|>",
43
+ "vision_eos_token": "<|vision_end|>"
44
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09267689b8362020b9763b65dd5be7e086b31e28d72e02837a9e781de9a91bc7
3
+ size 11423986
tokenizer_config.json ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "151669": {
214
+ "content": "<|audio_start|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<|audio_end|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "<tts_pad>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<tts_text_bos>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "<tts_text_eod>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151674": {
254
+ "content": "<tts_text_bos_single>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151675": {
262
+ "content": "<|audio_pad|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ }
269
+ },
270
+ "additional_special_tokens": [
271
+ "<|im_start|>",
272
+ "<|im_end|>",
273
+ "<|object_ref_start|>",
274
+ "<|object_ref_end|>",
275
+ "<|box_start|>",
276
+ "<|box_end|>",
277
+ "<|quad_start|>",
278
+ "<|quad_end|>",
279
+ "<|vision_start|>",
280
+ "<|vision_end|>",
281
+ "<|vision_pad|>",
282
+ "<|image_pad|>",
283
+ "<|video_pad|>",
284
+ "<|audio_start|>",
285
+ "<|audio_end|>",
286
+ "<tts_pad>",
287
+ "<tts_text_bos>",
288
+ "<tts_text_bos_single>",
289
+ "<|audio_pad|>"
290
+ ],
291
+ "audio_bos_token": "<|audio_start|>",
292
+ "audio_eos_token": "<|audio_end|>",
293
+ "audio_token": "<|audio_pad|>",
294
+ "bos_token": null,
295
+ "clean_up_tokenization_spaces": false,
296
+ "eos_token": "<|im_end|>",
297
+ "errors": "replace",
298
+ "extra_special_tokens": {
299
+ "audio_bos_token": "<|audio_start|>",
300
+ "audio_eos_token": "<|audio_end|>",
301
+ "audio_token": "<|audio_pad|>",
302
+ "image_token": "<|image_pad|>",
303
+ "video_token": "<|video_pad|>",
304
+ "vision_bos_token": "<|vision_start|>",
305
+ "vision_eos_token": "<|vision_end|>"
306
+ },
307
+ "image_token": "<|image_pad|>",
308
+ "model_max_length": 131072,
309
+ "pad_token": "<|endoftext|>",
310
+ "processor_class": "Qwen3OmniMoeProcessor",
311
+ "split_special_tokens": false,
312
+ "tokenizer_class": "Qwen2Tokenizer",
313
+ "unk_token": null,
314
+ "video_token": "<|video_pad|>",
315
+ "vision_bos_token": "<|vision_start|>",
316
+ "vision_eos_token": "<|vision_end|>"
317
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "dither": 0.0,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_rescale": true,
11
+ "do_resize": true,
12
+ "do_sample_frames": false,
13
+ "feature_extractor_type": "WhisperFeatureExtractor",
14
+ "feature_size": 128,
15
+ "fps": null,
16
+ "hop_length": 160,
17
+ "image_mean": [
18
+ 0.5,
19
+ 0.5,
20
+ 0.5
21
+ ],
22
+ "image_std": [
23
+ 0.5,
24
+ 0.5,
25
+ 0.5
26
+ ],
27
+ "input_data_format": null,
28
+ "max_frames": 768,
29
+ "max_pixels": 12845056,
30
+ "merge_size": 2,
31
+ "min_frames": 4,
32
+ "min_pixels": 3136,
33
+ "n_fft": 400,
34
+ "n_samples": 4800000,
35
+ "nb_max_frames": 30000,
36
+ "num_frames": null,
37
+ "pad_size": null,
38
+ "padding_side": "right",
39
+ "padding_value": 0.0,
40
+ "patch_size": 16,
41
+ "processor_class": "Qwen3OmniMoeProcessor",
42
+ "resample": 3,
43
+ "rescale_factor": 0.00392156862745098,
44
+ "return_attention_mask": true,
45
+ "return_metadata": false,
46
+ "sampling_rate": 16000,
47
+ "size": {
48
+ "longest_edge": 12845056,
49
+ "shortest_edge": 3136
50
+ },
51
+ "temporal_patch_size": 2,
52
+ "video_metadata": null,
53
+ "video_processor_type": "Qwen2VLVideoProcessor"
54
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff