Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues Paper • 2506.00958 • Published Jun 1 • 20
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation Paper • 2505.18842 • Published May 24 • 37
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! Paper • 2410.01023 • Published Oct 1, 2024 • 2
G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness Paper • 2505.05026 • Published May 8 • 16
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms Paper • 2503.14427 • Published Mar 18 • 19
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation Paper • 2410.13232 • Published Oct 17, 2024 • 45