ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.08826
  4. Cited By
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
v1v2v3 (latest)

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

12 February 2025
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
    RALM
ArXiv (abs)PDFHTML

Papers citing "Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation"

37 / 187 papers shown
Title
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via
  Question Answering
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Zhou Yu
D. Xu
Jun-chen Yu
Ting Yu
Zhou Zhao
Yueting Zhuang
Dacheng Tao
112
474
0
06 Jun 2019
OK-VQA: A Visual Question Answering Benchmark Requiring External
  Knowledge
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
117
1,090
0
31 May 2019
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language
  Feedback
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Hui Wu
Yupeng Gao
Xiaoxiao Guo
Ziad Al-Halah
Steven J. Rennie
Kristen Grauman
Rogerio Feris
EgoV
121
67
0
30 May 2019
A Survey on Biomedical Image Captioning
A Survey on Biomedical Image Captioning
Vasiliki Kougia
John Pavlopoulos
Ion Androutsopoulos
MedIm
67
83
0
26 May 2019
BERTScore: Evaluating Text Generation with BERT
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
349
5,860
0
21 Apr 2019
Cross-Modal Self-Attention Network for Referring Image Segmentation
Cross-Modal Self-Attention Network for Referring Image Segmentation
Linwei Ye
Mrigank Rochan
Zhi Liu
Yang Wang
EgoV
57
478
0
09 Apr 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for
  Video-and-Language Research
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
101
555
0
06 Apr 2019
COIN: A Large-scale Dataset for Comprehensive Instructional Video
  Analysis
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
Yansong Tang
Dajun Ding
Yongming Rao
Yu Zheng
Danyang Zhang
Lili Zhao
Jiwen Lu
Jie Zhou
132
317
0
07 Mar 2019
Deep Learning for Image Super-resolution: A Survey
Deep Learning for Image Super-resolution: A Survey
Zhihao Wang
Jian Chen
Guosheng Lin
SupR
71
1,439
0
16 Feb 2019
MIMIC-CXR-JPG, a large publicly available database of labeled chest
  radiographs
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Alistair E. W. Johnson
Tom Pollard
Nathaniel R. Greenbaum
M. Lungren
Chih-ying Deng
Yifan Peng
Zhiyong Lu
R. Mark
Seth Berkowitz
Steven Horng
MedIm
98
820
0
21 Jan 2019
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and
  Expert Comparison
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison
Jeremy Irvin
Pranav Rajpurkar
M. Ko
Yifan Yu
Silviana Ciurea-Ilcus
...
D. Larson
C. Langlotz
Bhavik Patel
M. Lungren
A. Ng
114
2,601
0
21 Jan 2019
nocaps: novel object captioning at scale
nocaps: novel object captioning at scale
Harsh Agrawal
Karan Desai
Yufei Wang
Xinlei Chen
Rishabh Jain
Mark Johnson
Dhruv Batra
Devi Parikh
Stefan Lee
Peter Anderson
VLM
131
486
0
20 Dec 2018
Fréchet Audio Distance: A Metric for Evaluating Music Enhancement
  Algorithms
Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
Kevin Kilgour
Mauricio Zuluaga
Dominik Roblek
Matthew Sharifi
81
197
0
20 Dec 2018
Textual Explanations for Self-Driving Vehicles
Textual Explanations for Self-Driving Vehicles
Jinkyu Kim
Anna Rohrbach
Trevor Darrell
John F. Canny
Zeynep Akata
58
346
0
30 Jul 2018
Representation Learning with Contrastive Predictive Coding
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord
Yazhe Li
Oriol Vinyals
DRLSSL
351
10,356
0
10 Jul 2018
Fashion-Gen: The Generative Fashion Dataset and Challenge
Fashion-Gen: The Generative Fashion Dataset and Challenge
Negar Rostamzadeh
Seyedarian Hosseini
Thomas Boquet
Wojciech Stokowiec
Yanzhe Zhang
Christian Jauvin
C. Pal
39
129
0
21 Jun 2018
COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval
COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval
Xirong Li
Chaoxi Xu
Xiaoxu Wang
Weiyu Lan
Zhengxiong Jia
Gang Yang
Jieping Xu
125
153
0
22 May 2018
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
Silvio Giancola
Mohieddine Amine
Tarek Dghaily
Guohao Li
AI4TS
100
197
0
12 Apr 2018
Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition
  Errors on Listening Comprehension
Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension
Chia-Hsuan Lee
Szu-Lin Wu
Chi-Liang Liu
Hung-yi Lee
66
98
0
01 Apr 2018
DVQA: Understanding Data Visualizations via Question Answering
DVQA: Understanding Data Visualizations via Question Answering
Kushal Kafle
Brian L. Price
Scott D. Cohen
Christopher Kanan
AIMat
82
396
0
24 Jan 2018
Demystifying MMD GANs
Demystifying MMD GANs
Mikolaj Binkowski
Danica J. Sutherland
Michael Arbel
Arthur Gretton
EGVM
169
1,500
0
04 Jan 2018
Localizing Moments in Video with Natural Language
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
118
949
0
04 Aug 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
730
132,363
0
12 Jun 2017
Multimodal Machine Learning: A Survey and Taxonomy
Multimodal Machine Learning: A Survey and Taxonomy
T. Baltrušaitis
Chaitanya Ahuja
Louis-Philippe Morency
111
2,937
0
26 May 2017
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for
  Reading Comprehension
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
222
2,686
0
09 May 2017
Dense-Captioning Events in Videos
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
139
1,249
0
02 May 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
75
830
0
28 Mar 2017
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
347
3,270
0
02 Dec 2016
Improved Image Captioning via Policy Gradient optimization of SPIDEr
Improved Image Captioning via Policy Gradient optimization of SPIDEr
Siqi Liu
Zhenhai Zhu
Ning Ye
S. Guadarrama
Kevin Patrick Murphy
153
446
0
01 Dec 2016
SPICE: Semantic Propositional Image Caption Evaluation
SPICE: Semantic Propositional Image Caption Evaluation
Peter Anderson
Basura Fernando
Mark Johnson
Stephen Gould
EGVM
106
1,919
0
29 Jul 2016
Multi30K: Multilingual English-German Image Descriptions
Multi30K: Multilingual English-German Image Descriptions
Desmond Elliott
Stella Frank
K. Simaán
Lucia Specia
VLM
131
590
0
02 May 2016
ShapeNet: An Information-Rich 3D Model Repository
ShapeNet: An Information-Rich 3D Model Repository
Angel X. Chang
Thomas Funkhouser
Leonidas Guibas
Pat Hanrahan
Qi-Xing Huang
...
Shuran Song
Hao Su
Jianxiong Xiao
L. Yi
Feng Yu
3DV
172
5,538
0
09 Dec 2015
Rethinking the Inception Architecture for Computer Vision
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy
Vincent Vanhoucke
Sergey Ioffe
Jonathon Shlens
Z. Wojna
3DVBDL
886
27,412
0
02 Dec 2015
VQA: Visual Question Answering
VQA: Visual Question Answering
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
217
5,503
0
03 May 2015
A Dataset for Movie Description
A Dataset for Movie Description
Anna Rohrbach
Marcus Rohrbach
Niket Tandon
Bernt Schiele
VGen
122
502
0
12 Jan 2015
CIDEr: Consensus-based Image Description Evaluation
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
297
4,508
0
20 Nov 2014
Microsoft COCO: Common Objects in Context
Microsoft COCO: Common Objects in Context
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
424
43,814
0
01 May 2014
Previous
1234