CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model

19 November 2024

Dongyoung Go

Abstract

The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has revolutionized information retrieval and expanded the practical applications of AI. However, current systems struggle in accurately interpreting user intent, employing diverse retrieval strategies, and effectively filtering unintended or inappropriate responses, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework that addresses these challenges through a multi-stage pipeline comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust filtering pipeline combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific concern defined by organizational policies. Extensive experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results, advancing the capabilities of multimodal retrieval systems.

View on arXiv

@article{go2025_2411.12287,
  title={ CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model },
  author={ Dongyoung Go and Taesun Whang and Chanhee Lee and Hwa-Yeon Kim and Sunghoon Park and Seunghwan Ji and Jinho Kim and Dongchan Kim and Young-Bum Kim },
  journal={arXiv preprint arXiv:2411.12287},
  year={ 2025 }
}

Comments on this paper