Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Paper • 2507.07990 • Published 2 days ago • 29
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers Paper • 2506.23918 • Published 12 days ago • 76
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics Paper • 2506.00070 • Published May 29 • 28
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues Paper • 2506.00958 • Published Jun 1 • 20
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation Paper • 2505.18842 • Published May 24 • 37
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation Paper • 2505.18842 • Published May 24 • 37
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation Paper • 2505.18842 • Published May 24 • 37 • 2
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms Paper • 2503.14427 • Published Mar 18 • 19
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild Paper • 2502.14892 • Published Feb 17 • 6
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild Paper • 2502.14892 • Published Feb 17 • 6 • 2
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild Paper • 2502.14892 • Published Feb 17 • 6