VLM4D: Towards Spatiotemporal Awareness in Vision Language Models Paper • 2508.02095 • Published Aug 4 • 8
The Unreasonable Effectiveness of Scaling Agents for Computer Use Paper • 2510.02250 • Published 17 days ago • 24
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning Paper • 2505.16186 • Published May 22 • 7
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models Paper • 2505.21523 • Published May 23 • 13
Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models Paper • 2506.00258 • Published May 30 • 3
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models Paper • 2507.13428 • Published Jul 17 • 15
Agents of Change: Self-Evolving LLM Agents for Strategic Planning Paper • 2506.04651 • Published Jun 5 • 8
VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation Paper • 2206.08522 • Published Jun 17, 2022
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space Paper • 2505.15778 • Published May 21 • 18
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents Paper • 2504.00906 • Published Apr 1 • 26
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models Paper • 2310.03903 • Published Oct 5, 2023 • 1
Neuro-Symbolic Procedural Planning with Commonsense Prompting Paper • 2206.02928 • Published Jun 6, 2022
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models Paper • 2407.12366 • Published Jul 17, 2024 • 4
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing Paper • 2410.12836 • Published Oct 3, 2024 • 1
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research Paper • 1904.03493 • Published Apr 6, 2019