Papers
arxiv:2411.12287

CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model

Published on Nov 19, 2024
Authors:
,
,
,
,
,
,
,
,

Abstract

The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has revolutionized information retrieval and expanded the practical applications of AI. However, current systems struggle in accurately interpreting user intent, employing diverse retrieval strategies, and effectively filtering unintended or inappropriate responses, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework that addresses these challenges through a multi-stage pipeline comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust filtering pipeline combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific concern defined by organizational policies. Evaluations on a multimodal Q&A dataset and a public safety benchmark demonstrate that CUE-M outperforms baselines in accuracy, knowledge integration, and safety, advancing the capabilities of multimodal retrieval systems.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.12287 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.12287 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.12287 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.