Papers
arxiv:2506.05412

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Published on Jun 4
· Submitted by Zory on Jun 12
Authors:
,
,
,
,
,
,
,
,

Abstract

Vision Language Models struggle with gaze-referential inference compared to humans, showing near-random guessing behavior, while top-tier models perform above chance but are still affected by task difficulty.

AI-generated summary

Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Community

Paper author Paper submitter

Knowing where someone looks is key to a Theory of Mind. We test 111 VLMs and 65 humans to compare their inferences. Our controlled study reveals a substantial performance gap between top-tier Vision-Language Models (VLMs) and humans, as well as behavioral patterns in VLMs' responses that suggest they are not simply guessing. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. VLM may rely on head orientation rather than eye gaze direction, making them less sensitive to side views that increase eye direction geometric ambiguity; however, this heuristic also makes them much less performant.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.05412 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.05412 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.05412 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.