Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baseliness

We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (MRAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, MRAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process MRAG effectively and construct a training set by filtering high-quality samples using designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the state-of-the-art GPT-4o model. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.
View on arXiv@article{ma2025_2411.16365, title={ Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baseliness }, author={ Zi-Ao Ma and Tian Lan and Rong-Cheng Tu and Yong Hu and Yu-Shi Zhu and Tong Zhang and Heyan Huang and Xian-Ling Mao }, journal={arXiv preprint arXiv:2411.16365}, year={ 2025 } }