Safetensors
qwen2_5_omni
File size: 8,567 Bytes
bb14150
 
 
 
 
 
23e096f
 
 
40da49a
 
23e096f
 
 
 
 
 
 
 
a8c6518
23e096f
 
 
 
 
 
 
 
 
 
 
 
637ce4f
23e096f
637ce4f
23e096f
a8c6518
 
 
 
 
 
 
 
23e096f
 
 
8a09ef4
23e096f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
datasets:
- amaai-lab/MusicBench
base_model:
- Qwen/Qwen2.5-Omni-7B
---

# Ke-Omni-R: Achieving Advanced Audio Reasoning with a Concise 50-Words Think Process
If you wish to train or perform inference with the model, please visit the GitHub repository: [https://github.com/shuaijiang/Ke-Omni-R/](https://github.com/shuaijiang/Ke-Omni-R/).
If you find this model helpful, please like this model and star our GitHub.

Ke-Omni-R is an advanced audio reasoning model built upon [Qwen2.5-Omni-7B](https://github.com/QwenLM/Qwen2.5-Omni). With only 10k post-training samples, Ke-Omni-R has achieved state-of-the-art performance on the MMAU *Test-mini* and *Test* benchmarks. Key insights from its development include:

- **GRPO Algorithm**: The GRPO algorithm significantly enhances the performance of the already strong base model (Qwen2.5-Omni-7B), demonstrating superior generalization even in unseen speech domains.
- **Think Process**: Incorporating a concise think process (less than 50 words) plays a crucial role in improving reasoning capabilities.
- **KL Divergence**: Slight improvements were observed during GRPO training by leveraging KL divergence.
- **Domain Ratio vs. Data Volume**: Domain diversity outweighs data volume. We utilized only 10k samples, with 5k randomly selected from AVQA and another 5k from MusicBench.

## Performance: Accuracies (%)↑ on MMAU Test-mini and Test benchmark
| Model                                 | Method                | Sound (Test-mini) | Sound (Test)  | Music (Test-mini) | Music (Test)  | Speech (Test-mini) | Speech (Test)  | Average (Test-mini) | Average (Test)  |
|---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
| -                                     | Human\*               | 86.31     | -     | 78.22     | -     | 82.17     | -     | 82.23     | -     |
| Gemini Pro 2.0 Flash                  | Direct Inference\*    | 56.46     | 61.73 | 58.68     | 56.53 | 51.65     | 61.53 | 55.60     | 59.93 |
| Audio Flamingo 2                      | Direct Inference\*    | 61.56     | 65.10 | **73.95** |**72.90**| 30.93     | 40.26 | 55.48     | 59.42 |
| GPT4o + Strong Cap.                   | Direct Inference\*    | 57.35     | 55.83 | 49.70     | 51.73 | 64.86     | **68.66** | 57.30     | 58.74 |
| Llama-3-8B-Instruct + Strong Cap.     | Direct Inference\*    | 50.75     | 49.10 | 48.93     | 48.93 | 55.25     | 62.70 | 52.10     | 53.57 |
| Qwen2-Audio-7B-Instruct               | Direct Inference\*    | 54.95     | 45.90 | 50.98     | 53.26 | 42.04     | 45.90 | 49.20     | 52.50 |
| SALAMONN                              | Direct Inference\*    | 41.00     | 40.30 | 34.80     | 33.76 | 25.50     | 24.24 | 33.70     | 32.77 |
| Audio-Reasoner(Qwen2-Audio-7B-Instruct) | \[1\]               | 60.06     | -     | 64.30     | -     | 60.70     | -     | 61.71     | -     |
| Audio-Cot(Qwen2-Audio-7B-Instruct)    | \[2\]                 | 61.86     | -     | 56.29     | -     | 55.26     | -     | 57.80     | -     |
| R1-AQA(Qwen2-Audio-7B-Instruct)       | \[3\]                 | 68.77     | 69.76 | 64.37     | 61.40 | 63.66     | 62.70 | 65.60     | 64.36 |
| Qwen2.5-Omni-3B                       | \[4\]                 | **70.27**     | -     | 60.48     | -     | 59.16     | -     | 63.30     | -     |
| Qwen2.5-Omni-7B                       | \[4\]                 | 67.87     | -     | 69.16     | -     | 59.76     | -     | 65.60     | -     |
| Ke-Omni-R(Qwen2.5-Omni-7B)            | GRPO(ours)            | 69.37 | **71.90** | 69.46 | 67.13 |**67.87**  | 67.10 | **68.90** |**68.71** |

## Performance: CER/WER (%)↓ on ASR benchmark
| Model                 | Method        |  WenetSpeech test-net | WenetSpeech test-meeting | LibriSpeech test-clean | LibriSpeech test-other|
| ---|----| ----| ----| ---- | ----|
| Qwen2.5-Omni-3B | \[4\] |  6.3 | 8.1 | 2.2 | 4.5 |
| Qwen2.5-Omni-7B | \[4\] | 5.9 | 7.7 | 1.8 | 3.4 |
| Ke-Omni-3B | ours | 11.7 | 16.1 | 1.8 | 3.8 |
| Ke-Omni-7B | ours | 7.5 | 9.8 | **1.6** | **3.1** |

Note:

- \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
  
- \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318.  

- \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246.

- \[3\] Li, Gang, et al. "Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering." arXiv preprint arXiv:2503.11197

- \[4\] Xu, Jin, et al. "Qwen2.5-Omni Technical Report." arXiv preprint arXiv:2503.20215


## Usage

```python

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info


# You can directly insert a local file path, a URL, or a base64-encoded audio into the position where you want in the text.
messages = [
  # Audio
    ## Local audio path
    [{"role": "system", "content":[{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
     {"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBtBeR6B00_000000.wav"}, {"type": "text", "text": "Please describe this audio."}]}],
    [{"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBtBeR6B00_000000.wav"}, {"type": "text", "text": "What is the main source of sound in the audio? ['aircraft', 'Car', 'Tank', 'Missile'] Output the thinking process (less than 50 words) in <think> </think> and final answer in <answer> </answer>."}]}],
    [{"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBXTktoom8_000030.wav"}, {"type": "text", "text": "What animal is the main source of sound in the video? ['dog', 'wasp', 'honeybee', 'dragonfly'] Output the thinking process (less than 50 words) in <think> </think> and final answer in <answer> </answer>."}]}],
]

model = Qwen2_5OmniForConditionalGeneration.from_pretrained('KE-Team/Ke-Omni-R')
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(text)
audios, images, videos = process_mm_info(messages, use_audio_in_video=False)
inputs = processor(text=text, images=images, videos=videos, audio=audios, padding=True, return_tensors="pt")

generation = model.generate(**inputs, thinker_temperature=0, thinker_do_sample=False)
generated_ids = generation[:, inputs.input_ids.size(1):]
completions = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(completions)
```

the output should be
```
["Well, it sounds like there's a car accelerating. You can hear the engine revving up, and there's a bit of a thump or thud sound too. It might be the car hitting something or just a part of the acceleration process. It gives off a sense of speed and power. What do you think about it? Do you have any other audio samples you want to talk about?", '<think>The audio features a vehicle accelerating and revving, which is characteristic of a car. The sound is consistent with a car engine, not an aircraft, tank, or missile.</think>\n<answer>Car</answer>', "<think>The main source of sound is a buzzing insect, which is consistent with the size and sound of a honeybee. The other options don't match the sound or context.</think>\n<answer>honeybee</answer>"]
```

## Acknowledgements
We express our gratitude to the following projects and teams for their contributions:
- **R1-AQA**: Referenced the GRPO-based training implementation from [R1-AQA](https://github.com/xiaomi-research/r1-aqa).
- **Qwen Team**: Special thanks to the [Qwen2.5-Omni-7B](https://github.com/QwenLM/Qwen2.5-Omni) model for providing a robust foundation.
- **Datasets**: 
  - [AVAQ](https://mn.cs.tsinghua.edu.cn/avqa/)
  - [MusicBench](https://amaai-lab.github.io/mustango/)
  - [MMAU](https://github.com/Sakshi113/MMAU/)


## Citation
```bib
@misc{zhao2025keomnir,
  author = {Zhao, Shuaijiang and Guo, Tingwei and Wen, Cheng and Xiang, Bajian and Zou, Wei},
  title = {Ke-Omni-R: Achieving Advanced Audio Reasoning with a Concise 50-Words Think Process},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/shuaijiang/Ke-Omni-R}},
}
```