Can Vision Language Models Infer Human Gaze Direction? A Controlled Study
Abstract
Vision Language Models struggle with gaze-referential inference compared to humans, showing near-random guessing behavior, while top-tier models perform above chance but are still affected by task difficulty.
Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.
Community
Knowing where someone looks is key to a Theory of Mind. We test 111 VLMs and 65 humans to compare their inferences. Our controlled study reveals a substantial performance gap between top-tier Vision-Language Models (VLMs) and humans, as well as behavioral patterns in VLMs' responses that suggest they are not simply guessing. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. VLM may rely on head orientation rather than eye gaze direction, making them less sensitive to side views that increase eye direction geometric ambiguity; however, this heuristic also makes them much less performant.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models (2025)
- Caption This, Reason That: VLMs Caught in the Middle (2025)
- Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models (2025)
- Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs (2025)
- IQBench: How"Smart'' Are Vision-Language Models? A Study with Human IQ Tests (2025)
- CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting (2025)
- Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper