Papers
arxiv:2505.13180

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Published on May 19
· Submitted by merlerm on May 20

Abstract

Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.

Community

Paper author Paper submitter

The paper explores planning with VLMs, both using them to generate action and as grounders for classical planners. We propose two tasks to benchmark state-of-the-art VLMs, a classical planning problem (Blocksworld) and a robotics household simulator. We find that VLM-as-planner methods work well in the household environment, where they generate coherent plans, but fail in Blocksworld, where the goals are more abstract. This suggests that VLMs can imitate an emergent world model on the household task, but this does not generalize. We further test the impact of CoT prompting, and surprisingly find it has little to no effect, adding evidence to the claim that VLMs are not able to reason as well as LLMs.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.13180 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.13180 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.13180 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.