Papers
arxiv:2501.04568

Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Published on Jan 8
Authors:
,
,
,
,

Abstract

SVP, a novel framework using self-captioning and a pre-trained grounding model, enhances vision-language alignment with minimal manually curated data, improving captioning, object recall, and hallucination control.

AI-generated summary

Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Sampling-based Visual Projection), a novel framework that enhances vision-language alignment without relying on manually curated text-image pairs or preference annotation. SVP leverages a small set of manually selected images, self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14 % average improvement in captioning tasks, up to 12 % increase in object recall, and significantly reduced hallucinations, while maintaining question-answering capabilities. Using SVP, a small VLM achieves hallucination reductions similar to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.04568 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.04568 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.04568 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.