Papers
arxiv:2505.03821

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Published on May 3
· Submitted by Gracjan on May 8
Authors:
,
,
,
,

Abstract

We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure's orientation - and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

Community

Paper author Paper submitter

We're exploring the limits of VLM spatial reasoning with "Beyond Recognition." Our new paper introduces a benchmark using controlled humanoid-object scenes to test visual perspective taking. While models like GPT-4o and Llama-3.2 nail scene understanding, their ability to take the humanoid's perspective drops sharply. Check it out!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.03821 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.03821 in a Space README.md to link it from this page.

Collections including this paper 1