Any-to-Any
Doctor-James commited on
Commit
dff955e
·
verified ·
1 Parent(s): b253ce3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -3
README.md CHANGED
@@ -1,3 +1,28 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - any-to-any
5
+ ---
6
+
7
+ ## Introduction
8
+ Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2X speedup and 63\% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts.
9
+
10
+ Paper: https://arxiv.org/abs/2503.08686
11
+
12
+ Code: https://github.com/hustvl/OmniMamba
13
+
14
+ ## Citation
15
+ If you find OmniMamba useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
16
+
17
+
18
+ ```bibtex
19
+ @misc{zou2025omnimambaefficientunifiedmultimodal,
20
+ title={OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models},
21
+ author={Jialv Zou and Bencheng Liao and Qian Zhang and Wenyu Liu and Xinggang Wang},
22
+ year={2025},
23
+ eprint={2503.08686},
24
+ archivePrefix={arXiv},
25
+ primaryClass={cs.CV},
26
+ url={https://arxiv.org/abs/2503.08686},
27
+ }
28
+ ```