InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy Paper • 2510.13778 • Published 17 days ago • 16
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation Paper • 2507.17520 • Published Jul 23 • 14
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations Paper • 2406.09401 • Published Jun 13, 2024
Unified Generative and Discriminative Training for Multi-modal Large Language Models Paper • 2411.00304 • Published Nov 1, 2024