Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.08826
Cited By
v1
v2
v3 (latest)
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
12 February 2025
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation"
37 / 187 papers shown
Title
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Zhou Yu
D. Xu
Jun-chen Yu
Ting Yu
Zhou Zhao
Yueting Zhuang
Dacheng Tao
112
474
0
06 Jun 2019
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
117
1,090
0
31 May 2019
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Hui Wu
Yupeng Gao
Xiaoxiao Guo
Ziad Al-Halah
Steven J. Rennie
Kristen Grauman
Rogerio Feris
EgoV
121
67
0
30 May 2019
A Survey on Biomedical Image Captioning
Vasiliki Kougia
John Pavlopoulos
Ion Androutsopoulos
MedIm
67
83
0
26 May 2019
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
349
5,860
0
21 Apr 2019
Cross-Modal Self-Attention Network for Referring Image Segmentation
Linwei Ye
Mrigank Rochan
Zhi Liu
Yang Wang
EgoV
57
478
0
09 Apr 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
101
555
0
06 Apr 2019
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
Yansong Tang
Dajun Ding
Yongming Rao
Yu Zheng
Danyang Zhang
Lili Zhao
Jiwen Lu
Jie Zhou
132
317
0
07 Mar 2019
Deep Learning for Image Super-resolution: A Survey
Zhihao Wang
Jian Chen
Guosheng Lin
SupR
71
1,439
0
16 Feb 2019
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Alistair E. W. Johnson
Tom Pollard
Nathaniel R. Greenbaum
M. Lungren
Chih-ying Deng
Yifan Peng
Zhiyong Lu
R. Mark
Seth Berkowitz
Steven Horng
MedIm
98
820
0
21 Jan 2019
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison
Jeremy Irvin
Pranav Rajpurkar
M. Ko
Yifan Yu
Silviana Ciurea-Ilcus
...
D. Larson
C. Langlotz
Bhavik Patel
M. Lungren
A. Ng
114
2,601
0
21 Jan 2019
nocaps: novel object captioning at scale
Harsh Agrawal
Karan Desai
Yufei Wang
Xinlei Chen
Rishabh Jain
Mark Johnson
Dhruv Batra
Devi Parikh
Stefan Lee
Peter Anderson
VLM
131
486
0
20 Dec 2018
Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
Kevin Kilgour
Mauricio Zuluaga
Dominik Roblek
Matthew Sharifi
81
197
0
20 Dec 2018
Textual Explanations for Self-Driving Vehicles
Jinkyu Kim
Anna Rohrbach
Trevor Darrell
John F. Canny
Zeynep Akata
58
346
0
30 Jul 2018
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord
Yazhe Li
Oriol Vinyals
DRL
SSL
351
10,356
0
10 Jul 2018
Fashion-Gen: The Generative Fashion Dataset and Challenge
Negar Rostamzadeh
Seyedarian Hosseini
Thomas Boquet
Wojciech Stokowiec
Yanzhe Zhang
Christian Jauvin
C. Pal
39
129
0
21 Jun 2018
COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval
Xirong Li
Chaoxi Xu
Xiaoxu Wang
Weiyu Lan
Zhengxiong Jia
Gang Yang
Jieping Xu
125
153
0
22 May 2018
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
Silvio Giancola
Mohieddine Amine
Tarek Dghaily
Guohao Li
AI4TS
100
197
0
12 Apr 2018
Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension
Chia-Hsuan Lee
Szu-Lin Wu
Chi-Liang Liu
Hung-yi Lee
66
98
0
01 Apr 2018
DVQA: Understanding Data Visualizations via Question Answering
Kushal Kafle
Brian L. Price
Scott D. Cohen
Christopher Kanan
AIMat
82
396
0
24 Jan 2018
Demystifying MMD GANs
Mikolaj Binkowski
Danica J. Sutherland
Michael Arbel
Arthur Gretton
EGVM
169
1,500
0
04 Jan 2018
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
118
949
0
04 Aug 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
730
132,363
0
12 Jun 2017
Multimodal Machine Learning: A Survey and Taxonomy
T. Baltrušaitis
Chaitanya Ahuja
Louis-Philippe Morency
111
2,937
0
26 May 2017
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
222
2,686
0
09 May 2017
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
139
1,249
0
02 May 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
75
830
0
28 Mar 2017
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
347
3,270
0
02 Dec 2016
Improved Image Captioning via Policy Gradient optimization of SPIDEr
Siqi Liu
Zhenhai Zhu
Ning Ye
S. Guadarrama
Kevin Patrick Murphy
153
446
0
01 Dec 2016
SPICE: Semantic Propositional Image Caption Evaluation
Peter Anderson
Basura Fernando
Mark Johnson
Stephen Gould
EGVM
106
1,919
0
29 Jul 2016
Multi30K: Multilingual English-German Image Descriptions
Desmond Elliott
Stella Frank
K. Simaán
Lucia Specia
VLM
131
590
0
02 May 2016
ShapeNet: An Information-Rich 3D Model Repository
Angel X. Chang
Thomas Funkhouser
Leonidas Guibas
Pat Hanrahan
Qi-Xing Huang
...
Shuran Song
Hao Su
Jianxiong Xiao
L. Yi
Feng Yu
3DV
172
5,538
0
09 Dec 2015
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy
Vincent Vanhoucke
Sergey Ioffe
Jonathon Shlens
Z. Wojna
3DV
BDL
886
27,412
0
02 Dec 2015
VQA: Visual Question Answering
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
217
5,503
0
03 May 2015
A Dataset for Movie Description
Anna Rohrbach
Marcus Rohrbach
Niket Tandon
Bernt Schiele
VGen
122
502
0
12 Jan 2015
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
297
4,508
0
20 Nov 2014
Microsoft COCO: Common Objects in Context
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
424
43,814
0
01 May 2014
Previous
1
2
3
4