arxiv:2506.04141

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Published on Jun 4

· Submitted by

tsq2000 on Jun 5

Upvote

Authors:

Zhuoran Jin ,

Hongbang Yuan ,

Jiachun Li ,

Shangqing Tu ,

Abstract

A new benchmark, MMR-V, is proposed to challenge multimodal large language models with long-range, multi-frame reasoning and hidden information processing in videos, revealing their limitations and inspiring further research.

AI-generated summary

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

tsq2000

Paper author Paper submitter 4 days ago

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to 🕵️locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. Models like o3 and o4-mini have achieved impressive results on "Think with Images" tasks, which require models to 🕵️mine evidence on image. Similarly, tasks in MMR-V require models to perform in-depth reasoning over visual information from different frames of a video, challenging their ability to 🕵️mine evidence across long-range multi-frame ("Think with Video").