Safetensors
omni
custom_code
undobug commited on
Commit
a28df62
·
verified ·
1 Parent(s): 7622606

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -1056
README.md CHANGED
@@ -3,1112 +3,101 @@ license: apache-2.0
3
  ---
4
  <div align="center">
5
 
6
- <img src="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/assets/logo.png" width="300em" ></img>
7
 
8
- <!-- <img src="https://raw.githubusercontent.com/baichuan-inc/Baichuan-Omni-1.5/refs/heads/main/assets/logo.png" width="300em" ></img>
9
- <img src="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/assets/train-pipeline.png" width="300em" ></img> -->
10
- <!-- <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework-v2.png" width="300em" ></img> -->
11
- **Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs**
12
 
13
 
14
 
15
- <p align="center">
16
- Baichuan-Omni-1.5 <a href="https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5">🤗</a> | Baichuan-Omni-1.5-Base <a href="https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5-Base">🤗</a> |Github <a href="https://github.com/baichuan-inc/Baichuan-Omni-1.5/">📖 </a> | Report <a href="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/baichuan_omni_1_5.pdf">📖</a>
17
  </p>
18
  </p>
19
- <p align="center">
20
- OpenMM-Medical <a href="https://huggingface.co/datasets/baichuan-inc/OpenMM-Medical">🤗</a> | OpenAudioBench <a href="https://huggingface.co/datasets/baichuan-inc/OpenAudioBench">🤗</a>
21
  </p>
22
- </div>
23
-
24
-
25
- <!-- ## 介绍
26
- **Baichuan-Omni-1.5** 是从 Baichuan-omni 升级的最新的、端到端训练的、支持全模态输入/双模态输出的多模态大模型。该模型使用Qwen2.5-7B昨晚大语言模型基座,可以以端到端方式,接受图像、视频、文本、音频作为输入,并且以可控的方式生成高质量文本和语音。
27
-
28
- - **Baichuan-Omni-1.5-Base**: 为促进全模态大模型发展,我们开源了使用高质量海量数据训练的全模态基座模型。该模型未经SFT指令微调,可塑性强,是**业内首个**开源的**全模态基座模型**。
29
-
30
- - **Baichuan-Omni-1.5**: 基于性能强悍的Baichuan-Omni-1.5-base,使用高质量的全模态对齐数据,进行端到端的多模态指令数据训练。Baichuan-Omni-1.5的纯文本、图像、视频、音频理解能力达到了 GPT-4o-mini 级别。可控音频生成的能力十分强大,在xxx和xxx评测集上取得最高表现。 -->
31
-
32
-
33
- ## Baichuan-Omni-1.5
34
-
35
- The Baichuan-Omni-1.5 is the latest, top-performing model in the Baichuan-omni series. This model is trained and inferred in an end-to-end manner. Compared with Baichuan-omni, this model has significant improvements in text/image/audio/video understanding and text/audio generation, and supports new features such as controllable real-time voice conversations and multi-modal real-time interactions. The main features of Baichuan-Omni-1.5 include:
36
-
37
- - 🔥 **Possess Multimodal Understanding and Interaction Capabilities.**
38
- Baichuan-Omni-1.5 not only supports images, videos, text, and audio as input, and generates high-quality text and voice output, but also **supports continuous video and audio streaming, and real-time voice interaction with users**. In OminiBench, a comprehensive evaluation benchmark for omnimodal understanding, Baichuan-Omni-1.5 has achieved the first-class level of the open source community and surpassed GPT-4o-mini.
39
-
40
- - 💪 **Strong Visual Capability.**
41
- Baichuan-Omni-1.5 has an average score of 73.3 on the OpenCompass list (comprehensive 10 mainstream multimodal evaluation benchmarks). **With the size of 7B, it surpasses mainstream commercial closed-source multimodal large models such as GPT-4o-mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single-image understanding**. In addition, its video understanding performance is also better than GPT-4V and Claude 3.5 Sonnet and open source omnimodal models.
42
-
43
- - 🚀 **Leading Medical Image Understanding Capabilities.**
44
- Baichuan-Omni-1.5 achieved the best performance on GMAI-MMBench and Openmm-Medical. Using only 7B LLM, the average score exceeded Qwen2-VL-72b by 3%, i.e. 80.7% v.s 83.8%.
45
-
46
- - 🎙 **Excellent Voice Capabilities.**
47
- Baichuan-Omni-1.5 **supports high-quality, controllable voice bilingual real-time conversations in Chinese and English**. It **outperforms GPT-4o-realtime** in speech understanding tasks (such as ASR and STT, etc.), and demonstrates **the highest speech generation performance among open source models** in semantic and acoustic evaluation of voice conversations.
48
-
49
- - 🎬 **Powerful Real-world Understanding and Other Features.**
50
- Baichuan-Omni-1.5 further optimizes the many visual understanding capabilities of Baichuan-omni. It can process images of any aspect ratio and up to 1.8 million pixels (such as 1344x1344). It scored 68.8 points on RealWorldQA, **surpassing commercial closed-source models such as GPT-4o-mini** and recently open-sourced omnimodal models. It scored 85.6/83.6 on the English/Chinese evaluation subsets of MMBench, respectively, which is also in the first echelon of models with the same size.
51
-
52
- - 💫 **Provides [🤗 Base Model](https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5-Base) and [🤗 Instruct Model](https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5).**
53
- Baichuan-Omni-1.5-Base is a high-performance foundational omni-modal model in the industry. Based on the powerful base, Baichuan-Omni-1.5 employs high-quality omnimodal alignment data to perform end-to-end multimodal instruction data training.
54
 
55
- **Model Architecture**
56
- <div align="center">
57
- <img src="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/assets/train-pipeline.png", width=80%></img>
58
-
59
  </div>
60
 
61
- <br>
62
-
63
- - **End-to-end Omni-modal Architecture.** We carefully design **multi-stage and end-to-end** progressive training of different modal encoding/decoding modules to make full use of the rich knowledge in different modalities, we expect different modal knowledge to complement each other.
64
- Notably, the model is fully trained end-to-end using NTP loss in the whole pre-training stage.
65
- - **High-quality Controllable Audio Solution.** Multimodal system prompts have been redesigned to include traditional text system prompts and **speech system prompts** for specifying model sounds. It provides the flexibility to control voice style through text or speech samples at inference time, and supports advanced capabilities such as end-to-end voice cloning and timbre creation.
66
-
67
-
68
- ### Open-source Evaluation Datasets
69
-
70
- **OpenMM-Medical**
71
-
72
- To comprehensively evaluate the model's multi-modal medical capabilities, we have constructed OpenMM-Medical, which includes data from 42 publicly available medical image datasets such as ACRIMA (retinal images), BioMediTech (microscope images), and CoronaHack (X-rays), totaling 88,996 images.
73
-
74
- **OpenAudioBench**
75
-
76
- To efficiently assess the model's "IQ" issues, we developed OpenAudioBench, comprising five end-to-end audio understanding sub-datasets: four public benchmarks (Llama Question, WEB QA, TriviaQA, AlpacaEval), and an internally created speech logical reasoning dataset by the Baichuan team, totaling 2,701 entries. This suite reflects the model's comprehensive "IQ" level.
77
-
78
- <!-- **High-quality Medical Image Evaluation Dataset--Openmm-Medical**
79
-
80
- - We have built a more diverse medical evaluation dataset named **Openmm-Medical** to evaluate large models in medical scenarios.
81
- - The images in Openmm-Medical come from **42 public medical image datasets**, such as ACRIMA (fundus images), BioMediTech (microscope images), and CoronaHack (X-rays).
82
- - **Openmm-Medical contains a total of 88,996 images**, and each image is designed as a **multiple-choice question to facilitate the evaluation of different large models.**
83
- - To promote the development of omnimodal large models in the medical field, we will soon **open** this evaluation dataset.
84
- -->
85
-
86
- ### Evaluation
87
 
88
- We sugguest readers to refer to our [**Github**](https://github.com/baichuan-inc/Baichuan-Omni-1.5/) for more details.
89
 
90
  <div align="center">
91
- <img src="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/assets/performance.png" , width=80%>
92
  </div>
93
-
94
  <br>
95
 
96
- <details>
97
-
98
- <summary>click to view</summary>
99
 
100
- #### Pure Text Understanding
101
  <div align="center">
102
- <table style="margin: 0 auto; text-align: center;">
103
- <thead>
104
- <tr>
105
- <th class="tg-c3ow" colspan="7">Comprehensive Tasks</th>
106
- </tr>
107
- </thead>
108
- <tbody>
109
- <tr>
110
- <td>Model</td>
111
- <td>Size</td>
112
- <td>MMLU (Acc.)</td>
113
- <td>CMMLU (Acc.)</td>
114
- <td>AGIEval (Acc.)</td>
115
- <td>C-Eval (Acc.)</td>
116
- <td>GAOKAO (Acc.)</td>
117
- </tr>
118
- <tr>
119
- <td colspan="7">Proprietary Models</td>
120
- </tr>
121
- <tr>
122
- <td>GPT 4o</td>
123
- <td>-</td>
124
- <td><b>88.0♢<br></td>
125
- <td><b>78.3♢<br></td>
126
- <td><b>62.3♢<br></td>
127
- <td><b>86.0♢<br></td>
128
- <td>-</td>
129
- </tr>
130
- <tr>
131
- <td>GPT 4o mini</td>
132
- <td>-</td>
133
- <td>82.0</td>
134
- <td>67.6</td>
135
- <td>52.2</td>
136
- <td>63.6</td>
137
- <td>70.8</td>
138
- </tr>
139
- <tr>
140
- <td colspan="7">Open-source Models (Pure text)</td>
141
- </tr>
142
- <tr>
143
- <td>MAP-Neo</td>
144
- <td>7B</td>
145
- <td>58.2</td>
146
- <td>55.1</td>
147
- <td>33.9</td>
148
- <td>57.5</td>
149
- <td>-</td>
150
- </tr>
151
- <tr>
152
- <td>Qwen1.5-Chat</td>
153
- <td>7B</td>
154
- <td>61.5</td>
155
- <td>68.0</td>
156
- <td>39.3</td>
157
- <td>68.8</td>
158
- <td>-</td>
159
- </tr>
160
- <tr>
161
- <td>Llama3-Instruct</td>
162
- <td>8B</td>
163
- <td>67.1</td>
164
- <td>51.7</td>
165
- <td>38.4</td>
166
- <td>50.7</td>
167
- <td>-</td>
168
- </tr>
169
- <tr>
170
- <td>OLMo</td>
171
- <td>7B</td>
172
- <td>28.4</td>
173
- <td>25.6</td>
174
- <td>19.9</td>
175
- <td>27.3</td>
176
- <td>-</td>
177
- </tr>
178
- <tr>
179
- <td colspan="7">Open-source Models (Omni-modal)</td>
180
- </tr>
181
- <tr>
182
- <td>VITA</td>
183
- <td>8x7B</td>
184
- <td>71.0*</td>
185
- <td>46.6</td>
186
- <td>46.2*</td>
187
- <td>56.7*</td>
188
- <td>-</td>
189
- </tr>
190
- <tr>
191
- <td>VITA-1.5</td>
192
- <td>7B</td>
193
- <td>71.0</td>
194
- <td>75.1</td>
195
- <td>47.9</td>
196
- <td>65.6</td>
197
- <td>57.4</td>
198
- </tr>
199
- <tr>
200
- <td>Baichuan-Omni</td>
201
- <td>7B</td>
202
- <td>65.3</td>
203
- <td>72.2</td>
204
- <td>47.7</td>
205
- <td>68.9</td>
206
- <td>-</td>
207
- </tr>
208
- <tr>
209
- <td>MiniCPM-o 2.6</td>
210
- <td>7B</td>
211
- <td>65.3</td>
212
- <td>63.3</td>
213
- <td>50.9</td>
214
- <td>61.5</td>
215
- <td>56.3</td>
216
- </tr>
217
- <tr>
218
- <td><b>Baichuan-Omni-1.5<br></td>
219
- <td>7B</td>
220
- <td>72.2</td>
221
- <td>75.5</td>
222
- <td>54.4</td>
223
- <td>73.1</td>
224
- <td><b>73.5<br></td>
225
- </tr>
226
- </tbody>
227
- </table>
228
  </div>
229
 
230
- </details>
231
-
232
-
233
- <details>
234
 
235
- <summary>click to view</summary>
236
-
237
- #### Image Understanding
238
 
239
  <div align="center">
240
- <table style="margin: 0 auto; text-align: center;">
241
- <thead>
242
- <tr>
243
- <th class="tg-c3ow" colspan="9">Multi-choice &amp; Yes-or-No Question</th>
244
- </tr>
245
- </thead>
246
- <tbody>
247
- <tr>
248
- <td>Model</td>
249
- <td>Size</td>
250
- <td>MMBench-EN (Acc.)</td>
251
- <td>MMbench-CN (Acc.)</td>
252
- <td>SEED-IMG (Acc.)</td>
253
- <td>MMMU-val (Acc.)</td>
254
- <td>HallusionBench (Acc.)</td>
255
- </tr>
256
- <tr>
257
- <td colspan="9">Proprietary Models</td>
258
- </tr>
259
- <tr>
260
- <td>GPT-4o</td>
261
- <td>-</td>
262
- <td>83.4♢</td>
263
- <td>82.1♢</td>
264
- <td>-</td>
265
- <td><b>69.1♢<br></td>
266
- <td><b>55.0♢<br></td>
267
- </tr>
268
- <tr>
269
- <td>GPT-4o-mini</td>
270
- <td>-</td>
271
- <td>77.7</td>
272
- <td>76.9</td>
273
- <td>72.3</td>
274
- <td>60.0♢</td>
275
- <td>46.1♢</td>
276
- </tr>
277
- <tr>
278
- <td colspan="9">Open Source Models (Vision-Language)</td>
279
- </tr>
280
- <tr>
281
- <td>Qwen2-VL-7B</td>
282
- <td>7B</td>
283
- <td><b>86.4<br></td>
284
- <td>81.9</td>
285
- <td><b>76.5<br></td>
286
- <td>52.7</td>
287
- <td>50.6∗</td>
288
- </tr>
289
- <tr>
290
- <td>MiniCPM-Llama3-V 2.5</td>
291
- <td>8B</td>
292
- <td>76.7</td>
293
- <td>73.3</td>
294
- <td>72.4</td>
295
- <td>45.8∗</td>
296
- <td>42.5</td>
297
- </tr>
298
- <tr>
299
- <td colspan="9">Open Source Models (Omni-modal)</td>
300
- </tr>
301
- <tr>
302
- <td>VITA</td>
303
- <td>8x7B</td>
304
- <td>74.7</td>
305
- <td>71.4</td>
306
- <td>72.6</td>
307
- <td>45.3</td>
308
- <td>39.7∗</td>
309
- </tr>
310
- <tr>
311
- <td>VITA-1.5</td>
312
- <td>7B</td>
313
- <td>80.8</td>
314
- <td>80.2</td>
315
- <td>74.2</td>
316
- <td>53.1</td>
317
- <td>44.1</td>
318
- </tr>
319
- <tr>
320
- <td>Baichuan-Omni</td>
321
- <td>7B</td>
322
- <td>76.2</td>
323
- <td>74.9</td>
324
- <td>74.1</td>
325
- <td>47.3</td>
326
- <td>47.8</td>
327
- </tr>
328
- <tr>
329
- <td>MiniCPM-o 2.6</td>
330
- <td>7B</td>
331
- <td>83.6</td>
332
- <td>81.8</td>
333
- <td>75.4</td>
334
- <td>51.1</td>
335
- <td>50.1</td>
336
- </tr>
337
- <tr>
338
- <td><b>Baichuan-Omni-1.5<br></td>
339
- <td>7B</td>
340
- <td>85.6</td>
341
- <td><b>83.6<br></td>
342
- <td>75.7</td>
343
- <td>53.9</td>
344
- <td>49.7</td>
345
- </tr>
346
- </tbody>
347
- </table>
348
  </div>
349
 
350
 
351
- <br>
352
-
 
353
  <div align="center">
354
- <table style="margin: 0 auto; text-align: center;">
355
- <thead>
356
- <tr>
357
- <th class="tg-c3ow" colspan="9">Visual Question Answering</th>
358
- </tr>
359
- </thead>
360
- <tbody>
361
- <tr>
362
- <td>Model</td>
363
- <td>Size</td>
364
- <td>RealWorldQA (Acc.)</td>
365
- <td>MathVista-mini (Acc.)</td>
366
- <td>TextVQA-val (Acc.)</td>
367
- <td>ChartQA (Acc.)</td>
368
- <td>OCRBench (Acc.)</td>
369
- </tr>
370
- <tr>
371
- <td colspan="8">Proprietary Models</td>
372
- </tr>
373
- <tr>
374
- <td>GPT-4o</td>
375
- <td>-</td>
376
- <td><b>75.4♢<br></td>
377
- <td>63.8♢</td>
378
- <td>-</td>
379
- <td>85.7♢</td>
380
- <td>73.6♢</td>
381
- </tr>
382
- <tr>
383
- <td>GPT-4o-mini</td>
384
- <td>-</td>
385
- <td>66.3</td>
386
- <td>53.4</td>
387
- <td>66.8</td>
388
- <td>-</td>
389
- <td>77.4</td>
390
- </tr>
391
- <tr>
392
- <td colspan="8">Open Source Models (Vision-Language)</td>
393
- </tr>
394
- <tr>
395
- <td>Qwen2-VL-7B</td>
396
- <td>7B</td>
397
- <td>69.7</td>
398
- <td>58.2∗</td>
399
- <td><b>84.3∗<br></td>
400
- <td>83.0∗</td>
401
- <td>84.5∗</td>
402
- </tr>
403
- <tr>
404
- <td>MiniCPM-Llama3-V 2.5</td>
405
- <td>8B</td>
406
- <td>63.5</td>
407
- <td>54.3∗</td>
408
- <td>76.6</td>
409
- <td>72.0</td>
410
- <td>72.5</td>
411
- </tr>
412
- <tr>
413
- <td colspan="8">Open Source Models (Omni-modal)</td>
414
- </tr>
415
- <tr>
416
- <td>VITA</td>
417
- <td>8x7B</td>
418
- <td>59.0</td>
419
- <td>44.9∗</td>
420
- <td>71.8</td>
421
- <td>76.6</td>
422
- <td>68.5∗</td>
423
- </tr>
424
- <tr>
425
- <td>VITA-1.5</td>
426
- <td>7B</td>
427
- <td>66.8</td>
428
- <td><b>66.5<br></td>
429
- <td>74.9</td>
430
- <td>79.6</td>
431
- <td>73.3</td>
432
- </tr>
433
- <tr>
434
- <td>Baichuan-Omni</td>
435
- <td>7B</td>
436
- <td>62.6</td>
437
- <td>51.9</td>
438
- <td>74.3</td>
439
- <td>79.6</td>
440
- <td>70.0</td>
441
- </tr>
442
- <tr>
443
- <td>MiniCPM-o 2.6</td>
444
- <td>7B</td>
445
- <td>67.7</td>
446
- <td>64.6</td>
447
- <td>80.1</td>
448
- <td><b>87.6<br></td>
449
- <td><b>89.7∗<br></td>
450
- </tr>
451
- <tr>
452
- <td>Baichuan-Omni-1.5 </td>
453
- <td>7B</td>
454
- <td>68.8</td>
455
- <td>63.6</td>
456
- <td>83.2</td>
457
- <td>84.9</td>
458
- <td>84.0</td>
459
- </tr>
460
- </tbody>
461
- </table>
462
  </div>
463
 
 
464
 
465
- </details>
466
-
467
- <details>
468
-
469
- <summary>click to view</summary>
470
-
471
- #### Video Understanding
472
  <div align="center">
473
- <table style="margin: 0 auto; text-align: center;">
474
- <thead>
475
- <tr>
476
- <th colspan="7">General VQA&nbsp;&nbsp;&nbsp;</th>
477
- </tr>
478
- </thead>
479
- <tbody>
480
- <tr>
481
- <td>Model</td>
482
- <td>Size</td>
483
- <td># Frames</td>
484
- <td>MVBench (Acc.)</td>
485
- <td>Egoschema (Acc.)</td>
486
- <td>VideoMME (Acc.)</td>
487
- <td>Perception-Test (Acc.)</td>
488
- </tr>
489
- <tr>
490
- <td colspan="7">Proprietary Models</td>
491
- </tr>
492
- <tr>
493
- <td>Gemini 1.5 Pro</td>
494
- <td>-</td>
495
- <td>-</td>
496
- <td><b>81.3♢<br></td>
497
- <td>63.2*</td>
498
- <td><b>75.0♢<br></td>
499
- <td>-</td>
500
- </tr>
501
- <tr>
502
- <td>GPT 4o mini</td>
503
- <td>-</td>
504
- <td>-</td>
505
- <td>55.2</td>
506
- <td>58.5</td>
507
- <td>63.6</td>
508
- <td>48.2</td>
509
- </tr>
510
- <tr>
511
- <td>GPT 4o</td>
512
- <td>-</td>
513
- <td>-</td>
514
- <td>-</td>
515
- <td><b>77.2*<br></td>
516
- <td>71.9♢</td>
517
- <td>-</td>
518
- </tr>
519
- <tr>
520
- <td>GPT 4V</td>
521
- <td>-</td>
522
- <td>-</td>
523
- <td>43.7♢</td>
524
- <td>55.6*</td>
525
- <td>59.9♢</td>
526
- <td>-</td>
527
- </tr>
528
- <tr>
529
- <td colspan="7">Open-source Models (Vision-language)</td>
530
- </tr>
531
- <tr>
532
- <td>Qwen2-VL-7B</td>
533
- <td>7B</td>
534
- <td>2 fps (max 768)</td>
535
- <td>67.0* | 64.4</td>
536
- <td>66.7* | 66.6</td>
537
- <td>63.3* | 59.0</td>
538
- <td>62.3* | 60.3</td>
539
- </tr>
540
- <tr>
541
- <td>AnyGPT</td>
542
- <td>8B</td>
543
- <td>48</td>
544
- <td>33.2</td>
545
- <td>32.1</td>
546
- <td>29.8</td>
547
- <td>29.1</td>
548
- </tr>
549
- <tr>
550
- <td>VideoLLaMA 2</td>
551
- <td>7B</td>
552
- <td>16</td>
553
- <td>54.6*</td>
554
- <td>51.7*</td>
555
- <td>46.6*</td>
556
- <td>51.4*</td>
557
- </tr>
558
- <tr>
559
- <td>VideoChat2</td>
560
- <td>7B</td>
561
- <td>16</td>
562
- <td>51.1*</td>
563
- <td>42.1♢</td>
564
- <td>33.7♢</td>
565
- <td>47.3♢</td>
566
- </tr>
567
- <tr>
568
- <td>LLaVA-NeXT-Video</td>
569
- <td>7B</td>
570
- <td>32</td>
571
- <td>46.5♢</td>
572
- <td>43.9♢</td>
573
- <td>33.7♢</td>
574
- <td>48.8♢</td>
575
- </tr>
576
- <tr>
577
- <td>Video-LLaVA</td>
578
- <td>7B</td>
579
- <td>8</td>
580
- <td>41.0♢</td>
581
- <td>38.4♢</td>
582
- <td>39.9♢</td>
583
- <td>44.3♢</td>
584
- </tr>
585
- <tr>
586
- <td colspan="7">Open-source Models (Omni-modal)</td>
587
- </tr>
588
- <tr>
589
- <td>VITA</td>
590
- <td>8x7B</td>
591
- <td>1 fps (max 32)</td>
592
- <td>53.4</td>
593
- <td>53.9</td>
594
- <td>56.1</td>
595
- <td>56.2</td>
596
- </tr>
597
- <tr>
598
- <td>VITA-1.5</td>
599
- <td>7B</td>
600
- <td>1 fps (max 32)</td>
601
- <td>55.5</td>
602
- <td>54.7</td>
603
- <td>57.3</td>
604
- <td>57.6</td>
605
- </tr>
606
- <tr>
607
- <td>Baichuan-Omni</td>
608
- <td>7B</td>
609
- <td>1 fps (max 32)</td>
610
- <td>60.9</td>
611
- <td>58.8</td>
612
- <td>58.2</td>
613
- <td>56.8</td>
614
- </tr>
615
- <tr>
616
- <td>MiniCPM-o 2.6</td>
617
- <td>7B</td>
618
- <td>1 fps (max 64)</td>
619
- <td>58.6</td>
620
- <td>50.7</td>
621
- <td>63.4</td>
622
- <td>66.6</td>
623
- </tr>
624
- <tr>
625
- <td>Baichuan-Omini-1.5</td>
626
- <td>7B</td>
627
- <td>1 fps (max 32)</td>
628
- <td> 63.7 </td>
629
- <td> 62.4 </td>
630
- <td> 60.1 </td>
631
- <td> <b>68.9 <br> </td>
632
- </tr>
633
- </tbody>
634
- </table>
635
  </div>
636
 
637
  <br>
638
 
639
- <div align="center">
640
- <table style="margin: 0 auto; text-align: center;">
641
- <thead>
642
- <tr>
643
- <th colspan="7">Open-ended VQA</th>
644
- </tr>
645
- </thead>
646
- <tbody>
647
- <tr>
648
- <td rowspan="2">Model</td>
649
- <td rowspan="2">Size</td>
650
- <td rowspan="2"># Frames</td>
651
- <td colspan="2">ActivityNet-QA</td>
652
- <td colspan="2">MSVD-QA</td>
653
- </tr>
654
- <tr>
655
- <td>(Acc.)</td>
656
- <td>(Score)</td>
657
- <td>(Acc.)</td>
658
- <td>(Score)</td>
659
- </tr>
660
- <tr>
661
- <td colspan="7">Proprietary Models</td>
662
- </tr>
663
- <tr>
664
- <td>Gemini 1.5 Pro</td>
665
- <td>-</td>
666
- <td>-</td>
667
- <td>56.7*</td>
668
- <td>-</td>
669
- <td>-</td>
670
- <td>-</td>
671
- </tr>
672
- <tr>
673
- <td>GPT 4o mini</td>
674
- <td>-</td>
675
- <td>1 fps (max 32)</td>
676
- <td>62.1</td>
677
- <td>3.1</td>
678
- <td>67.5</td>
679
- <td>3.3</td>
680
- </tr>
681
- <tr>
682
- <td>GPT 4o</td>
683
- <td>-</td>
684
- <td>-</td>
685
- <td>61.9*</td>
686
- <td>-</td>
687
- <td>-</td>
688
- <td>-</td>
689
- </tr>
690
- <tr>
691
- <td>GPT 4V</td>
692
- <td>-</td>
693
- <td>-</td>
694
- <td>59.5*</td>
695
- <td>-</td>
696
- <td>-</td>
697
- <td>-</td>
698
- </tr>
699
- <tr>
700
- <td colspan="7">Open-source Models (Vision-language)</td>
701
- </tr>
702
- <tr>
703
- <td>Qwen2 VL</td>
704
- <td>7B</td>
705
- <td>2 fps (max 768)</td>
706
- <td>17.4</td>
707
- <td>1.9</td>
708
- <td>61.1</td>
709
- <td>3.5</td>
710
- </tr>
711
- <tr>
712
- <td>VideoLLaMA 2</td>
713
- <td>7B</td>
714
- <td>16</td>
715
- <td>50.2*</td>
716
- <td>3.3*</td>
717
- <td>70.9*</td>
718
- <td>3.8*</td>
719
- </tr>
720
- <tr>
721
- <td>VideoChat2</td>
722
- <td>7B</td>
723
- <td>16</td>
724
- <td>49.1*</td>
725
- <td>3.3*</td>
726
- <td>70.0*</td>
727
- <td>3.9*</td>
728
- </tr>
729
- <tr>
730
- <td>LLaVA-NeXT-Video</td>
731
- <td>7B</td>
732
- <td>32</td>
733
- <td>53.5*</td>
734
- <td>3.2*</td>
735
- <td>67.4</td>
736
- <td>3.4</td>
737
- </tr>
738
- <tr>
739
- <td>Video-LLaVA</td>
740
- <td>7B</td>
741
- <td>8</td>
742
- <td>45.3*</td>
743
- <td>3.3*</td>
744
- <td>70.7*</td>
745
- <td>3.9*</td>
746
- </tr>
747
- <tr>
748
- <td colspan="7">Open-source Models (Omni-modal)</td>
749
- </tr>
750
- <tr>
751
- <td>VITA</td>
752
- <td>8x7B</td>
753
- <td>1 fps (max 32)</td>
754
- <td>55.0</td>
755
- <td>3.5</td>
756
- <td>63.9</td>
757
- <td>3.7</td>
758
- </tr>
759
- <tr>
760
- <td>VITA-1.5</td>
761
- <td>7B</td>
762
- <td>1 fps (max 32)</td>
763
- <td>59.6</td>
764
- <td>3.0</td>
765
- <td>67.6</td>
766
- <td>3.3</td>
767
- </tr>
768
- <tr>
769
- <td>Baichuan-Omni</td>
770
- <td>7B</td>
771
- <td>1 fps (max 48)</td>
772
- <td>58.6</td>
773
- <td><b>3.7<br></td>
774
- <td>72.2</td>
775
- <td> <b>4.0<br> </td>
776
- </tr>
777
- <tr>
778
- <td>MiniCPM-o 2.6</td>
779
- <td>7B</td>
780
- <td>1 fps (max 64)</td>
781
- <td><b>63.0<br></td>
782
- <td>3.1</td>
783
- <td>73.7</td>
784
- <td>3.6</td>
785
- </tr>
786
- <tr>
787
- <td>Baichuan-Omni-1.5</td>
788
- <td>7B</td>
789
- <td>1 fps (max 48)</td>
790
- <td> 62.0</td>
791
- <td> 3.1</td>
792
- <td> <b> 74.2 <br></td>
793
- <td> 3.6</td>
794
- </tr>
795
- </tbody>
796
- </table>
797
- </div>
798
-
799
- </details>
800
-
801
-
802
- <details>
803
-
804
- <summary>click to view</summary>
805
-
806
- #### Audio Comprehensive and Speech Generation
807
- <div align="center">
808
- <table style="margin: 0 auto; text-align: center;">
809
- <thead>
810
- <tr>
811
- <th colspan="12">Audio Comprehensive Capacity</th>
812
- </tr></thead>
813
- <tbody>
814
- <tr>
815
- <td rowspan="2">Model</td>
816
- <td rowspan="2">Size</td>
817
- <td colspan="2">Reasoning QA</td>
818
- <td colspan="2">Llama Questions</td>
819
- <td colspan="2">Web Questions</td>
820
- <td colspan="2">TriviaQA</td>
821
- <td colspan="2">AlpacaEval</td>
822
- </tr>
823
- <tr>
824
- <td>s→t</td>
825
- <td>s→s</td>
826
- <td>s→t</td>
827
- <td>s→s</td>
828
- <td>s→t</td>
829
- <td>s→s</td>
830
- <td>s→t</td>
831
- <td>s→s</td>
832
- <td>s→t</td>
833
- <td>s→s</td>
834
- </tr>
835
- <tr>
836
- <td colspan="12">Proprietary Models</td>
837
- </tr>
838
- <tr>
839
- <td>GPT-4o-Audio</td>
840
- <td>-</td>
841
- <td><b>55.6</td>
842
- <td>-</td>
843
- <td><b>88.4</td>
844
- <td>-</td>
845
- <td><b>8.10</td>
846
- <td>-</td>
847
- <td><b>9.06</td>
848
- <td>-</td>
849
- <td><b>8.01</td>
850
- <td>-</td>
851
- </tr>
852
- <tr>
853
- <td colspan="12">Open-source Models (Pure Audio)</td>
854
- </tr>
855
- <tr>
856
- <td>GLM-4-Voice</td>
857
- <td>9B</td>
858
- <td>-</td>
859
- <td>26.5</td>
860
- <td>-</td>
861
- <td>71.0</td>
862
- <td>-</td>
863
- <td>5.15</td>
864
- <td>-</td>
865
- <td>4.66</td>
866
- <td>-</td>
867
- <td>4.89</td>
868
- </tr>
869
- <tr>
870
- <td colspan="12">Open-source Models (Omni-modal)</td>
871
- </tr>
872
- <tr>
873
- <td>VITA-1.5</td>
874
- <td>7B</td>
875
- <td>41.0</td>
876
- <td>-</td>
877
- <td>74.2</td>
878
- <td>-</td>
879
- <td>5.73</td>
880
- <td>-</td>
881
- <td>4.68</td>
882
- <td>-</td>
883
- <td>6.82</td>
884
- <td>-</td>
885
- </tr>
886
- <tr>
887
- <td>MiniCPM-o 2.6</td>
888
- <td>7B</td>
889
- <td>38.6</td>
890
- <td>-</td>
891
- <td>77.8</td>
892
- <td>-</td>
893
- <td>6.86</td>
894
- <td>-</td>
895
- <td>6.19</td>
896
- <td>-</td>
897
- <td>5.18</td>
898
- <td>-</td>
899
- </tr>
900
- <tr>
901
- <td><b>Baichuan-Omni-1.5</td>
902
- <td>7B</td>
903
- <td>50.0</td>
904
- <td><b>40.9</td>
905
- <td>78.5</td>
906
- <td><b>75.3</td>
907
- <td>5.91</td>
908
- <td><b>5.52</td>
909
- <td>5.72</td>
910
- <td>5.31</td>
911
- <td>7.79</td>
912
- <td><b>6.94</td>
913
- </tr>
914
- </tbody>
915
- </table>
916
- </div>
917
-
918
-
919
- </details>
920
-
921
 
922
 
923
- <details>
924
 
925
- <summary>click to view</summary>
926
-
927
- #### Omni-modal Understanding
928
-
929
- <div align="center">
930
- <table style="margin: 0 auto; text-align: center;">
931
- <thead>
932
- <tr>
933
- <th colspan="7">Omni-Undesratnding </th>
934
- </tr>
935
- <thead>
936
- <tbody>
937
- <tr>
938
- <td>Model</td>
939
- <td>Size</td>
940
- <td>Image & Audio</td>
941
- <td>Image Caption & Audio</td>
942
- <td>Image & Audio Transcript</td>
943
- <td>Image Caption & Audio Transcript</td>
944
- </tr>
945
- </thead>
946
- <tr>
947
- <td colspan="6">Proprietary Models</td>
948
- </tr>
949
- <tr>
950
- <td>GPT4o-mini</td>
951
- <td>-</td>
952
- <td>-</td>
953
- <td>-</td>
954
- <td>37.0</td>
955
- <td>37.7</td>
956
- </tr>
957
- <tr>
958
- <td colspan="6">Open-source Models (Omni-modal)</td>
959
- </tr>
960
- <tr>
961
- <td>VITA</td>
962
- <td>8x7B</td>
963
- <td>33.1</td>
964
- <td>31.8</td>
965
- <td>42.0</td>
966
- <td>44.2</td>
967
- </tr>
968
- <tr>
969
- <td>VITA-1.5</td>
970
- <td>7B</td>
971
- <td>33.4</td>
972
- <td>29.6</td>
973
- <td>48.5</td>
974
- <td><b>47.2<br></td>
975
- </tr>
976
- <tr>
977
- <td>Baichuan-Omni</td>
978
- <td>7B</td>
979
- <td>32.2</td>
980
- <td>26.5</td>
981
- <td>42.6</td>
982
- <td>44.2</td>
983
- </tr>
984
- <tr>
985
- <td>MiniCPM-o 2.6</td>
986
- <td>7B</td>
987
- <td>40.5</td>
988
- <td>30.8</td>
989
- <td><b>53.2<br></td>
990
- <td>46.3</td>
991
- </tr>
992
- <tr>
993
- <td><b>Baichuan-Omni-1.5<br></td>
994
- <td>7B</td>
995
- <td><b>42.9<br></td>
996
- <td><b>37.7<br></td>
997
- <td>47.9</td>
998
- <td>46.9</td>
999
- </tr>
1000
- </tbody>
1001
- </table>
1002
- </div>
1003
-
1004
- </details>
1005
-
1006
- <details>
1007
-
1008
- <summary>click to view</summary>
1009
 
1010
- #### Medical Image Understanding Capabilities
1011
 
 
1012
  <div align="center">
1013
- <table style="margin: 0 auto; text-align: center;">
1014
- <thead>
1015
- <tr>
1016
- <th colspan="7">Medical Understanding&nbsp;&nbsp;&nbsp;</th>
1017
- </tr>
1018
- </thead>
1019
- <tbody>
1020
- <tr>
1021
- <td>Model</td>
1022
- <td>Size</td>
1023
- <td>GMAI-MMB-VAL (Acc.)</td>
1024
- <td>OpenMM-Medical (Acc.)</td>
1025
- </tr>
1026
- </thead>
1027
- <tr>
1028
- <td colspan="4">Proprietary Models</td>
1029
- </tr>
1030
- <tr>
1031
- <td>GPT4o-mini</td>
1032
- <td>-</td>
1033
- <td>46.4</td>
1034
- <td>74.3</td>
1035
- </tr>
1036
- <tr>
1037
- <td colspan="4">Open-source Models (Vision-Language)</td>
1038
- </tr>
1039
- <tr>
1040
- <td>Qwen2 VL</td>
1041
- <td>7B</td>
1042
- <td>46.3</td>
1043
- <td>76.9</td>
1044
- </tr>
1045
- <tr>
1046
- <td>Qwen2 VL</td>
1047
- <td>72B</td>
1048
- <td><b>50.7<br></td>
1049
- <td>80.7</td>
1050
- </tr>
1051
- <tr>
1052
- <td colspan="4">Open-source Models (Omni-modal)</td>
1053
- </tr>
1054
- <tr>
1055
- <td>VITA-1.5</td>
1056
- <td>7B</td>
1057
- <td>36.7</td>
1058
- <td>67.1</td>
1059
- </tr>
1060
- <tr>
1061
- <td>MiniCPM-o 2.6</td>
1062
- <td>7B</td>
1063
- <td>41.5</td>
1064
- <td>73.6</td>
1065
- </tr>
1066
- <tr>
1067
- <td><b>Baichuan-Omni-1.5<br></td>
1068
- <td>7B</td>
1069
- <td>49.9</td>
1070
- <td><b>83.8<br></td>
1071
- </tr>
1072
- </tbody>
1073
- </table>
1074
- </div>
1075
-
1076
- </details>
1077
-
1078
- ## Examples
1079
- <br>
1080
-
1081
- <div style="display: flex; flex-direction: column; align-items: center;">
1082
- <img src="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/assets/pipeline.png" alt="pipeline" style="margin-bottom: 5px;">
1083
- <img src="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/assets/math.png" alt="math" style="margin-bottom: 5px;">
1084
- <img src="https://github.com/baichuan-inc/Baichuan-Omni-1.5/raw/main/assets/fly_bill.png" alt="fly_bill" style="margin-bottom: 5px;">
1085
  </div>
1086
 
1087
 
1088
- ## 🚀 Quick Start
1089
- We recommend interested scholars to visit our github repo for more details. [**Github**](https://github.com/baichuan-inc/Baichuan-Omni-1.5/)
1090
-
1091
-
1092
- ### Statement
1093
- - We hereby declare that our team has not developed any applications based on Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models, not on iOS, Android, the web, or any other platform. We strongly call on all users not to use Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models for any activities that harm national / social security or violate the law. Also, we ask users not to use Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models for Internet services that have not undergone appropriate security reviews and filings. We hope that all users can abide by this principle and ensure that the development of technology proceeds in a regulated and legal environment.
1094
-
1095
- - We have done our best to ensure the compliance of the data used in the model training process. However, despite our considerable efforts, there may still be some unforeseeable issues due to the complexity of the model and data. Therefore, if any problems arise due to the use of Baichuan-Omni-1.5/Baichuan-Omni-1.5-base open-source models, including but not limited to data security issues, public opinion risks, or any risks and problems brought about by the model being misled, abused, spread or improperly exploited, we will not assume any responsibility.
1096
 
 
 
 
 
1097
 
1098
 
1099
  ### License
1100
- The community usage of Baichuan-Omni-1.5/Baichuan-Omni-1.5-base requires adherence to [Apache 2.0](https://github.com/baichuan-inc/Baichuan-Omni-1.5/blob/main/LICENSE) and [Community License for Baichuan-Omni-1.5 Models](https://github.com/baichuan-inc/Baichuan-Omni-1.5/blob/main/LICENSE). The Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models supports commercial use. If you plan to use the Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models or its derivatives for commercial purposes, please ensure that your entity meets the following conditions:
1101
-
1102
- 1. The Daily Active Users (DAU) of your or your affiliate's service or product is less than 1 million.
1103
- 2. Neither you nor your affiliates are software service providers or cloud service providers.
1104
- 3. There is no possibility for you or your affiliates to grant the commercial license given to you, to reauthorize it to other third parties without Baichuan's permission.
1105
-
1106
- Upon meeting the above conditions, you need to submit the application materials required by the Baichuan-Omni-1.5 Model Community License Agreement via the following contact email: [email protected]. Once approved, Baichuan will hereby grant you a non-exclusive, global, non-transferable, non-sublicensable, revocable commercial copyright license.
1107
 
1108
  <!-- ### Citation
 
1109
 
1110
- If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
1111
  ```bib
1112
- @article{
1113
- } -->
 
 
 
 
 
1114
  ```
 
3
  ---
4
  <div align="center">
5
 
6
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/logo.png" width="300em" ></img>
7
 
8
+ <!-- <img src="https://raw.githubusercontent.com/baichuan-inc/Baichuan-Audio/refs/heads/main/assets/logo.png" width="300em" ></img>
9
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/audiollm.png" width="300em" ></img> -->
10
+
11
+ **Open-Source End-to-End Speech Interaction Foundation Model**
12
 
13
 
14
 
15
+ <p align="center">
16
+ Baichuan-Audio <a href="https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct">🤗</a> | Baichuan-Audio-Base <a href="https://huggingface.co/baichuan-inc/Baichuan-Audio-Base">🤗</a> | Technical Report <a href="https://arxiv.org/abs/2501.15368">📖</a>
17
  </p>
18
  </p>
19
+ <p align="center">
20
+ OpenAudioBench <a href="https://huggingface.co/datasets/baichuan-inc/openAudioBench">🤗</a> | Training Data <a href="#">🤗</a> <small>(Coming Soon)</small>
21
  </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
 
 
 
 
23
  </div>
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ ### Model Architecture
27
 
28
  <div align="center">
29
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/audiollm.png" , width=85%>
30
  </div>
 
31
  <br>
32
 
33
+ **Baichuan-Auido** mainly consists of Baichuan-Audio Tokenizer, Audio LLM, and Flow-matching based Audio Decoder. First, speech is converted into discrete audio tokens by the Baichuan-Audio Tokenizer. Then, Audio LLM generates aligned text and audio tokens in an interleaved manner, achieving seamless modality switching between text and audio through special tokens. Audio tokens are processed by an independent audio head and reconstructed into high-quality Mel spectrograms using a flow-matching based audio decoder, which are then converted into audio waveforms via a vocoder.
 
 
34
 
35
+ - Baichuan-Audio-Tokenizer uses a 12.5hz frame rate design. It employs Whisper Large Encoder to extract high-level audio features from Mel spectrograms, then uses 8-layer RVQ to minimize information loss during quantization. To capture both semantic and acoustic information, we use Mel spectrogram reconstruction and Pre-trained LLM for acoustic and semantic supervision, respectively.
36
  <div align="center">
37
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/vq.png" , width=30%>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  </div>
39
 
40
+ - Audio LLM generates aligned text and audio tokens in an interleaved manner, achieving seamless switching between text and audio modalities through special tokens. Audio tokens are processed by an independent audio head.
 
 
 
41
 
42
+ - Flow-matching based Audio Decoder is used to reconstruct high-quality Mel spectrograms. The model is trained on 24 kHz audio to generate target Mel spectrograms, which are then converted into audio waveforms via a vocoder.
 
 
43
 
44
  <div align="center">
45
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/decoder.png" , width=24%>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  </div>
47
 
48
 
49
+ ### Pre-training details
50
+ - #### Pre-training data
51
+ Audio training data can be broadly divided into two main types: audio understanding data and audio generation data.
52
  <div align="center">
53
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/table.png" , width=80%>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  </div>
55
 
56
+ Audio-text paired data (e.g., ASR and TTS data) improves performance on basic speech tasks. On the other hand, pure audio data enhances the ability to handle audio modalities independently. Audio-Text Interleaved data consists of alternating text and audio modalities, segmented by punctuation to facilitate cross-modal knowledge transfer. Interleaved Text-to-Speech data consists of fully aligned text and audio content, aimed at enhancing the model's ability to generate audio tokens under text supervision.
57
 
58
+ The interleaved data collection process is divided into crawling and synthesis types, resulting in a total of 142k hours of ITTS data and 393k hours of INTLV data.
 
 
 
 
 
 
59
  <div align="center">
60
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/data.png" , width=80%>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  </div>
62
 
63
  <br>
64
 
65
+ - #### Two stage training strategy
66
+ The conflict between speech and text modalities may interfere with the pre-trained text knowledge representation in pre-trained LLMs, leading to degradation in model intelligence performance. To mitigate this, we adopt a two-stage training strategy. In the first stage, the LLM parameters remain fixed, and only the audio embedding layer and audio head parameters are updated. In the second stage, all parameters except the LM embedding layer and LM head parameters are trained.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
 
69
+ ### Open-Source Evaluation Set
70
 
71
+ **OpenAudioBench**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
+ To more efficiently evaluate the model's "intelligence," we have constructed OpenAudioBench, which includes 5 sub-evaluation sets for end-to-end audio understanding. These include 4 public evaluation sets (llama question, WEB QA, TriviaQA, AlpacaEval) and a speech logical reasoning evaluation set built by the Baichuan team, totaling 2701 data points. This comprehensive set reflects the model's "intelligence" level.
74
 
75
+ ### Model performance
76
  <div align="center">
77
+ <img src="https://github.com/baichuan-inc/Baichuan-Audio/raw/main/assets/result.png" , width=90%>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  </div>
79
 
80
 
81
+ ### Acknowledgments
 
 
 
 
 
 
 
82
 
83
+ - Automatic Speech Recognition (ASR) Model: 【Whisper】(https://github.com/openai/whisper)
84
+ - Large Language Model (LLM): 【Qwen2.5 7B】(https://arxiv.org/abs/2412.15115)
85
+ - Partial code from: CosyVoice and Matcha-TTS: (https://github.com/FunAudioLLM/CosyVoice, https://github.com/shivammehta25/Matcha-TTS/)
86
+ - HiFi-GAN Vocoder from CosyVoice 2.0: (https://funaudiollm.github.io/cosyvoice2/)
87
 
88
 
89
  ### License
90
+ The use of Baichuan-Audio-Base/Baichuan-Audio model weights must comply with the [License](https://huggingface.co//baichuan-inc/Baichuan-Audio/blob/main/LICENSE) and [Apache 2.0](https://github.com/baichuan-inc/Baichuan-Audio/blob/main/LICENSE)
 
 
 
 
 
 
91
 
92
  <!-- ### Citation
93
+ If you find our model/code/paper helpful, please give us a ⭐ and cite 📝, thank you!
94
 
 
95
  ```bib
96
+ @article{li2025baichuan,
97
+ title={Baichuan-Omni-1.5 Technical Report},
98
+ author={Li, Yadong and Liu, Jun and Zhang, Tao and Chen, Song and Li, Tianpeng and Li, Zehuan and Liu, Lijun and Ming, Lingfeng and Dong, Guosheng and Pan, Da and others},
99
+ journal={arXiv preprint arXiv:2501.15368},
100
+ year={2025}
101
+ }
102
+ ``` -->
103
  ```