Ming-Omni: A Unified Multimodal Model for Perception and Generation
Abstract
Ming-Omni is a unified multimodal model with dedicated encoders and modality-specific routers that can process images, text, audio, and video, and performs tasks like speech and image generation, context-aware chatting, and versatile image editing.
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction (2025)
- Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities (2025)
- Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (2025)
- OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment (2025)
- HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation (2025)
- Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM (2025)
- HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper