Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations Paper • 2506.18898 • Published 3 days ago • 23
CoMemo: LVLMs Need Image Context with Image Memory Paper • 2506.06279 • Published 20 days ago • 7 • 2
ZeroGUI: Automating Online GUI Learning at Zero Human Cost Paper • 2505.23762 • Published 28 days ago • 46
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published Apr 14 • 274
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published Nov 15, 2024 • 80