VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Abstract
VideoDeepResearch, a text-only reasoning model with modular tools, surpasses existing baselines in long video understanding tasks without extending context windows or enhancing visual perception capabilities.
Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.
Community
Upload paper "VideoDeepResearch: Long Video Understanding With Agentic Tool Using"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning (2025)
- ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding (2025)
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? (2025)
- VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models (2025)
- VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks (2025)
- VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? (2025)
- Unleashing Hour-Scale Video Training for Long Video-Language Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper