arxiv:2503.20215

Qwen2.5-Omni Technical Report

Published on Mar 26

· Submitted by

Bakerbunker on Mar 27

#1 Paper of the day

Upvote

101

Authors:

Zhifang Guo ,

Jinzheng He ,

Shuai Bai ,

Keqin Chen ,

Jialin Wang ,

Kai Dang ,

Yunfei Chu ,

Junyang Lin

Abstract

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

Bakerbunker

Paper submitter 4 days ago

Qwen Chat: https://chat.qwenlm.ai

Blog: https://qwenlm.github.io/blog/qwen2.5-omni/

GitHub: https://github.com/QwenLM/Qwen2.5-Omni

Hugging Face: https://huggingface.co/Qwen/Qwen2.5-Omni-7B

Hugging Face Space: https://huggingface.co/spaces/Qwen/Qwen2.5-Omni-7B-Demo

ModelScope: https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B

DashScope: https://help.aliyun.com/zh/model-studio/user-guide/qwen-omni

ModelScope Demo: https://modelscope.cn/studios/Qwen/Qwen2.5-Omni-Demo

LeroyDyer

4 days ago

•

edited 4 days ago

Hi
Very great model and i hope this model archetecture does become the standard for llms !
The problem has been the wrappers used by the transformers library !

which is that we need to be able to stack PreProcessors on top of each other , such as a image processor and a video processor and a audio processor ! ...

this is what essenatially you have pioneered here :

I would like to request that you make the model more universal !

As with the model creation process seems to be locked to the qwen models only when in fact it could be unlocked for all llama based models as most models are also using a descendant of this model type, such as mistral etc llama .... so these models which have also been highly trained will need to be able to use as a text model segment of the omni model :
Such as the whisper / wave2vec is being used as the audio component and the Clip/vit being used as the imaging component... these parent architectures can be considered standardized archtectures so they should be enbabled for the generation of a pretrained configuration , perhaps with additional configuration for the cross attention ...

It can be seen that the qwen models are a part of the llava model family which also did have the ablity to enable for llama models to be configured for training with a image processor ... the llava next models also enabled for this features ! as well as the onevision models : so i would hope that the omni models would also be able to be as universal as these model s.

we still have issues to face with other providers taking time to enable to the easy creation of the gguf files !
as this enables for external platforms to run these quantized models , will you also release a quantizing library for these vision / audio / omni modalitys ? to solve this isssue ?

Please explain if you will be making the model truly universal by implementin suc ideals ?
as in truth you have solved what most people have been waiting for !

With a universal configuration we can expect to see more realtime application for the ai , such that sampling from your webcam input and accepting the microphone input , via streaming to the models would be availbale with the omni models !
as the inputs can be processed in a single model instead of the model stack which people have been forced to use ! causing the GPU providers and model providers to skim the cream in sales of units and services !

when in fact a good 7b model can perform as well as a 70b model !

Another Note: Is that the models can already produce images output !
the image out is because it was already trained on captions so we should be able to produce a base64 image(text) as output and reconvert the base64 code to the image generated ! for my personal mistral i did train it on this sucessfully but only the training set was regenerating ! ( so it needs mass training on base64 images , to be able to regenerate such images or generalise some image )