Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Ming-Lite-Uni is an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results reveal the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. Ming-Lite-Uni is in alpha stage and will soon be further refined.
Thank you all for your continued interest and support! We truly appreciate your patience as we steadily advance our solutions and model performance. We're already making great progress and seeing promising results, and exciting updates will be coming soon—so stay tuned!
Why It Matters
Ming-Lite-Uni's unified architecture overcomes fundamental limitations of conventional approaches:
Conventional Methods | Ming-Lite-Uni's Advantages |
---|---|
Modular Pipelines (CLIP/SigLIP + Diffusion Models) |
End-to-End Unified Model Seamless understanding-generation integration |
Discrete Token AR (Limited visual grounding) |
Continuous Token Space Native support for fine-grained visual concepts |
Fixed-Resolution Processing (Artifacts in upscaling) |
Multi-Scale Adaptation Consistent quality across resolutions |
Separate Editing Workflows (Manual alignment required) |
Dialog-Driven Control Natural language guided pixel-level editing |
Understanding Bottlenecks (Visual-semantic mismatch) |
Joint Representation Learning Mutually enhanced comprehension and generation |
Key Enhancements
- Unified Visual Understanding & Generation Architecture. Ming-Lite-Uni achieves an average understanding score of 69.7 on the OpenCompass leaderboard, surpassing DeepSeek-VL2 (66.4). At the same time, it achieves an image generation score of 0.62 on the GenEval benchmark, outperforming SDXL (0.55).
- Multi-Scale Learnable Tokens. We employ a novel mechanism to establish feature correlations across resolutions of 4×/8×/16×. By introducing hierarchical tokens, the model captures global layout (low-res), object structures (mid-res), and fine textures (high-res), improving GenEval by 3.5%.
- Multi-Scale Representation Alignment. We introduce a novel scale wised consistency loss to enforce alignment between hierarchical representations and final outputs through native-resolution optimization. This strategy directly enhances the high-res reconstruction quality (>2dB PSNR) and boosts GenEval by 1.5%.
- AGI-Capable System. Our model supports complex chained operations, such as "generate castle → add sunset → adjust perspective", with a swift response time of under 1 second (benchmarked with RTX 4090). The system is designed to handle instruction-driven generation-editing and is synchronized with ChatGPT-4o(aligned with the industry milestone of March 2025).
Empowering Multimodal Interaction with Ming-Lite-Uni
Ming-Lite-Uni acts as a unified model for multimodal understanding, extending beyond traditional NLP tasks and multimodal comprehension to enable interactive multimodal generation. This includes capabilities such as image generation, image editing, and style transfer.
Model Structure
Ming-Lite-Uni is a unified multimodal model designed for both image understanding and high-fidelity image generation. It achieves this by compressing image representations into continuous visual tokens, which are processed alongside discrete text tokens using a scaled auto-regressive Transformer. The generation capability is powered by an externally trained diffusion model (SANA), conditioned on tokens produced by the Transformer.
Benchmark Evaluations
We conduct separate quantitative evaluations of Ming-Lite-Uni on multimodal understanding and text-to-image generation using public benchmarks. For multimodal understanding, we compare against traditional models that take images and text as input and output text, as well as against recent models with visual generative capabilities. For multimodal generation, we evaluate text-to-image performance on GenEval. Please refer to our TechReport for details.
Multimodal Understanding
Type | Model | Avg. | MMB | MMS | MMMU | MathV | Hall | AI2D | MM-Vet |
---|---|---|---|---|---|---|---|---|---|
Und. Only | LLaVA-72B | 68.0 | 84.5 | 65.8 | 56.6 | 68.4 | 47.9 | 86.2 | 60.6 |
Qwen2.5-VL-7B | 76.2 | 87.8 | 71.1 | 67.9 | 70.8 | 58.8 | 88.2 | 76.7 | |
Emu3-Chat | - | 58.5 | - | 31.6 | - | - | - | 37.2 | |
InternVL2.5-78B | 75.2 | 87.5 | 69.5 | 70 | 71.4 | 57.4 | 89.1 | 71.8 | |
DeepSeek-VL2 | 66.4 | 81.2 | 61.0 | 50.7 | 59.4 | 51.5 | 84.5 | 60.0 | |
GPT-4o-20241120 (closed) | 72.0 | 84.3 | 65.1 | 70.7 | 59.9 | 56.2 | 84.9 | 74.5 | |
Step-1o (closed) | 77.7 | 87.3 | 69.3 | 69.9 | 74.7 | 55.8 | 89.1 | 82.8 | |
Und. and Gen. | TokenFlow-XL | - | 68.9 | - | 38.7 | - | - | - | 40.7 |
Janus-Pro-7B | - | 79.2 | - | 41.0 | - | - | - | 50.0 | |
Ours (Ming-Lite-Uni) | 69.7 | 80.7 | 60.5 | 51.2 | 68.3 | 51.8 | 84.5 | 72.3 |
Image Generation
Type | Method | Single Obj. | Two Obj. | Counting | Colors | Position | Color Attri. | Overall |
---|---|---|---|---|---|---|---|---|
Gen. Only | LlamaGen | 0.71 | 0.34 | 0.21 | 0.58 | 0.07 | 0.04 | 0.32 |
SDv2.1 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.50 | |
Emu3-Gen | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 | |
SDXL | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | |
DALL-E 3 | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 | |
SD3-Medium | 0.99 | 0.94 | 0.72 | 0.89 | 0.33 | 0.60 | 0.74 | |
Und. and Gen. | Show-o | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 | 0.53 |
TokenFlow-XL | 0.95 | 0.60 | 0.41 | 0.81 | 0.16 | 0.24 | 0.55 | |
Janus-Pro-1B | 0.98 | 0.82 | 0.51 | 0.89 | 0.65 | 0.56 | 0.73 | |
Ours (Ming-Lite-Uni) | 0.99 | 0.76 | 0.53 | 0.87 | 0.26 | 0.30 | 0.62 |
Example Usage
System Requirements
- Python: >= 3.8
- PyTorch: >= 2.4.1+cu12.2 (CUDA 12.2 compatible)
- flash-attn: >= 2.6.3
Installation
We recommend installing the following versions to set up your environment using pip:
pip install -r requirements.txt
Below is an example of how to load and use the model:
import torch
import os
from Ming_Uni.MingUniInference import Ming_Uni_Inference
from Ming_Uni.process import MyProcessor
device = torch.cuda.current_device()
device = torch.device(device)
model_path='../Ming-Lite-Uni/'
model = Ming_Uni_Inference(model_path)
model.to(torch.bfloat16)
model.to(device)
model.eval()
llm_model=os.path.join(model_path, 'qwen2_5_llm')
my_proc=MyProcessor(llm_model)
image_file = "tests/cake.jpg"
prompt = "add a candle on top of the cake"
inputs = my_proc.process(image_file=image_file, prompt=prompt, device=device)
result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
result.save("result.png")
For more advanced usage, such as fine-tuning or generating images, refer to the documentation.
Link to the code: https://github.com/inclusionAI/Ming