arxiv:2507.13405

COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

Published on Jul 17

Authors:

Abstract

COREVQA benchmark evaluates VLMs on visual entailment in crowded images, revealing significant performance gaps.

AI-generated summary

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs' ability to reason over certain types of image-question pairs in crowded scenes.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.13405 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.13405 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.