CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification Paper • 2508.21046 • Published 29 days ago • 9
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers Paper • 2501.16297 • Published Jan 27 • 1
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts Paper • 2506.10357 • Published Jun 12 • 21