shentao.scott
update readme
aad3282
<div align="center">
# MammothModa2: Jointly Optimized Autoregressive-Diffusion Models for Unified Multimodal Understanding and Generation
<img src='./doc/logo.png' alt="MammothModa Logo" width="100" style="max-width: 100px; height: auto;">
[![GitHub](https://img.shields.io/badge/MammothModa2-GitHub-blue)](https://github.com/bytedance/mammothmoda)
[![Project Page](https://img.shields.io/badge/MammothModa2-Project_Page-green)](https://ali-vilab.github.io/MammothModa-Page/)
[![HuggingFace](https://img.shields.io/badge/MammothModa2-HuggingFace_Model-yellow)](https://huggingface.co/bytedance-research/MammothModa2-Preview)
</div>
## Introduction
MammothModa2 is a unified Autoregressive-Diffusion (AR-Diffusion) framework designed for comprehensive multimodal understanding and generation. The model adopts a novel serial architecture: the AR backbone utilizes MammothTok—a unified, language-aligned visual tokenizer—to execute complex semantic planning, which then conditions a high-fidelity Diffusion Decoder. Our core technical contribution is a unified joint training strategy, pioneering the simultaneous optimization of the discrete Next-Token Prediction (NTP) loss and the continuous Flow Matching loss within a serial AR-Diffusion system. This end-to-end alignment between the planning and generation spaces enables MammothModa to achieve competitive performance across complex text-to-image generation, editing, and visual understanding benchmarks.
## Show cases
<!-- <div align="center">
<img src='./mammoth.png' alt="MammothModa Overview" width="80%">
</div> -->
<div align="center">
<img src='./doc/mammoth.png' alt="MammothModa2 Show cases" style="max-width: 80%; height: auto;">
</div>
## 🎉 News
- [x] 2025-10-01: 🔥MammothModa2-Preview models are now available at [HuggingFace](https://huggingface.co/bytedance-research/MammothModa2-Preview)
## 🪄 Models
| Model | Download Link | License |
|-------|---------------|----------|
| MammothModa2-Preview | [🤗 HuggingFace](https://huggingface.co/bytedance-research/MammothModa2-Preview) | [Apache-2.0](https://opensource.org/licenses/Apache-2.0) |
## ⚙️ Installation
The codebase has been tested with Python 3.11.9, CUDA 12.4, and PyTorch 2.6.0. You can set up the environment using uv with the following command:
```bash
# Clone the repository
git clone https://github.com/bytedance/mammothmoda.git
cd mammothmoda
# Install dependencies
uv sync --frozen
```
## 🚀 Usage
### Text-to-Image Generation
```python
import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
from mammothmoda2.model import DEFAULT_NEGATIVE_PROMPT, Mammothmoda2Model
from mammothmoda2.utils import decode_diffusion_image
# Mammothmoda2 model and processor loading.
model = Mammothmoda2Model.from_pretrained(
"bytedance-research/MammothModa2-Preview",
attn_implementation="flash_attention_2",
torch_dtype="bfloat16",
t2i_generate=True,
).to("cuda")
processor = AutoProcessor.from_pretrained(
"bytedance-research/MammothModa2-Preview",
t2i_generate=True,
ar_height=32,
ar_width=32,
)
# Mammothmoda2 inputs preprocessing.
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "这张图片展示了一座现代化城市的美丽景象。画面中最显眼的是一座高耸入云的摩天大楼,其外立面在夕阳余晖的映照下显得格外醒目。周围环绕着多栋风格各异的高楼大厦,这些大楼的窗户透出点点灯光,显示出城市的繁华。左侧有一座带有绿色圆顶的建筑,造型独特。在建筑物前方的水面上,有几艘白色的帆船正在航行,给城市增添了一份灵动的气息。天空呈现出浪漫的粉色,可能是日出或日落时分,整个画面色彩柔和,充满了宁静与美好的氛围。",
},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
num_images_per_prompt=4,
cfg_scale=6.0,
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
padding=True,
padding_side="left",
return_tensors="pt",
return_token_type_ids=False, # Or generate would raise error.
).to("cuda")
# Mammothmoda2 t2i generate.
with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16):
generated_ids, attention_mask = model.generate(**inputs)
diff_return_info = decode_diffusion_image(
input_ids=inputs.input_ids,
generated_ids=generated_ids,
attention_mask=attention_mask,
negative_ids=inputs.get("negative_ids", None),
negative_mask=inputs.get("negative_mask", None),
model=model,
tokenizer=processor.tokenizer,
output_dir="./mammothmoda2_t2i_release",
num_images_per_prompt=4,
text_guidance_scale=9.0,
vae_scale_factor=16,
cfg_range=(0.0, 1.0),
num_inference_steps=50,
height=1024,
width=1024,
)
```
### Multi-modal Understanding
```python
import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
from mammothmoda2.model import Mammothmoda2Model
# Mammothmoda2 model and processor loading.
model = Mammothmoda2Model.from_pretrained(
"bytedance-research/MammothModa2-Preview",
attn_implementation="flash_attention_2",
torch_dtype="bfloat16",
).to("cuda")
print(f"model.device={model.device}")
processor = AutoProcessor.from_pretrained("bytedance-research/MammothModa2-Preview")
# Mammothmoda2 inputs preprocessing.
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
padding_side="left",
return_tensors="pt",
return_token_type_ids=False,
).to("cuda")
# Mammothmoda2 model generation and decoding.
with torch.inference_mode(), torch.autocast(dtype=torch.bfloat16):
generated_ids = model.generate(**inputs)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
```
## 📊 Benchmark Results
| Model | Model Size | GenEval | DPGBench |
|-------|------------|---------|----------|
| **Generation** |
| SDXL | - | 0.55 | 74.65 |
| DALL-E 3 | - | 0.67 | 83.50 |
| FLUX.1-dev | - | 0.67 | 84.00 |
| SD3.5-Medium* | - | 0.65 | 83.86 |
| **Unified** |
| Emu3 | 8B | 0.66 | 80.60 |
| Janus-Pro | 7B | 0.80 | 84.19 |
| MetaQuery-XL | 7B + 1.6B | 0.80 | 82.05 |
| UniWorld-V1 | 7B + 12B | 0.84 | 81.38 |
| Blip3-o-8B | 7B + 1.4B | 0.84 | 81.60 |
| OmniGen2 | 3B + 4B | 0.86 | 83.57 |
| Ovis-U1 | 2.4B + 1.2B | 0.89 | 83.72 |
| UniPic2 | 7B + 2B | 0.90 | 83.79 |
| BAGEL | 7B + 7B | 0.88 | 85.07 |
| Show-o2 | 7B | 0.76 | 86.14 |
| GPT-4o | - | 0.84 | 86.23 |
| MammothModa2-Preview | 7B + (3B + 2B) | 0.85 | 87.1 |
**Note**: Model sizes in "A + B" format indicate separate understanding (A) and generation (B) parameters. Models without "+" share parameters for both tasks. MammothModa2-Preview uses a 7B + (3B + 2B) architecture, where the 7B parameters are for understanding, and the generation part consists of 3B parameters in the AR (MLLM backbone) and 2B parameters in the DiT component.
## Acknowledgement
We are grateful to the following open-source projects:
- [OmniGen2](https://github.com/VectorSpaceLab/OmniGen2)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
## Citation
```bibtex
@misc{mammothmoda2025,
title = {MammothModa2: Jointly Optimized Autoregressive-Diffusion Models for Unified Multimodal Understanding and Generation},
author = {MammothModa Team},
year = {2025},
url = {https://github.com/bytedance/mammothmoda}
}