Papers
arxiv:2504.09130

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Published on Apr 12
Ā· Submitted by LibraTree on Apr 15
Authors:
,
,

Abstract

Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

Community

Paper author Paper submitter

šŸ“¢ VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

šŸ¤” Current LVLMs struggle with complex reasoning like multi-hop geometry problems. How can AI agents utilize and construct more useful visual hints?

šŸ”‘ Key insight: When LVLMs perform reasoning, they need not only "WHAT to do" but also a mental model of "WHAT WILL HAPPEN after each action"! This brings LVLMs more powerful reasoning performance. #NextLevelAI šŸ¤–

Paper author Paper submitter

Current methods focus on visual-aided reasoning or test scaling. Our VisuoThink framework combines both and introduces a mechanism called lookahead tree search.
ęˆŖå±2025-04-12 16.59.18.png

Paper author Paper submitter

Through exploring different trajectories and predicting what-will-happen, LVLMs construct more reliable auxiliary lines when solving geometry problems and perform better in spatial reasoning tasks.

ęˆŖå±2025-04-12 16.59.41.png

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.09130 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.09130 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.09130 in a Space README.md to link it from this page.

Collections including this paper 2