ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.23990
7
0

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

29 May 2025
Mingyang Mao
Mariela M. Perez-Cabarcas
Utteja Kallakuri
Nicholas R. Waytowich
Xiaomin Lin
T. Mohsenin
ArXivPDFHTML
Abstract

To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios.To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.

View on arXiv
@article{mao2025_2505.23990,
  title={ Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding },
  author={ Mingyang Mao and Mariela M. Perez-Cabarcas and Utteja Kallakuri and Nicholas R. Waytowich and Xiaomin Lin and Tinoosh Mohsenin },
  journal={arXiv preprint arXiv:2505.23990},
  year={ 2025 }
}
Comments on this paper