ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.08826
  4. Cited By
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
v1v2v3 (latest)

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

12 February 2025
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
    RALM
ArXiv (abs)PDFHTML

Papers citing "Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation"

50 / 187 papers shown
Title
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.5K
14,699
0
15 Mar 2023
RAMM: Retrieval-augmented Biomedical Visual Question Answering with
  Multi-modal Pre-training
RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
Zheng Yuan
Qiao Jin
Chuanqi Tan
Zhengyun Zhao
Hongyi Yuan
Fei Huang
Songfang Huang
88
27
0
01 Mar 2023
Can Pre-trained Vision and Language Models Answer Visual
  Information-Seeking Questions?
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Yang Chen
Hexiang Hu
Yi Luan
Haitian Sun
Soravit Changpinyo
Alan Ritter
Ming-Wei Chang
99
94
0
23 Feb 2023
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
  Retrieval
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Kuniaki Saito
Kihyuk Sohn
Xiang Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
95
119
0
06 Feb 2023
Multimodal Chain-of-Thought Reasoning in Language Models
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang
Aston Zhang
Mu Li
Hai Zhao
George Karypis
Alexander J. Smola
LRM
109
464
0
02 Feb 2023
MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
Xiaojie Jin
Bowen Zhang
Weibo Gong
Kai Xu
XueQing Deng
Peng Wang
Zhaoyu Zhang
Xiaohui Shen
Jiashi Feng
25
3
0
19 Jan 2023
GeoDE: a Geographically Diverse Evaluation Dataset for Object
  Recognition
GeoDE: a Geographically Diverse Evaluation Dataset for Object Recognition
V. V. Ramaswamy
S. Lin
Dora Zhao
Aaron B. Adcock
Laurens van der Maaten
Deepti Ghadiyaram
Olga Russakovsky
88
37
0
05 Jan 2023
Enhancing Multi-modal and Multi-hop Question Answering via Structured
  Knowledge and Unified Retrieval-Generation
Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation
Qian Yang
Qian Chen
Wen Wang
Baotian Hu
Min Zhang
80
27
0
16 Dec 2022
Faster Maximum Inner Product Search in High Dimensions
Faster Maximum Inner Product Search in High Dimensions
Mo Tiwari
Ryan Kang
Je-Yong Lee
Luke Lee
Chris Piech
Sebastian Thrun
Ilan Shomorony
Martin Jinye Zhang
65
6
0
14 Dec 2022
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with
  Multi-Source Multimodal Knowledge Memory
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
Ziniu Hu
Ahmet Iscen
Chen Sun
Zirui Wang
Kai-Wei Chang
Yizhou Sun
Cordelia Schmid
David A. Ross
Alireza Fathi
RALMVLM
89
95
0
10 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
131
331
0
06 Dec 2022
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion
  and Keyword-to-Caption Augmentation
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
CLIP
129
537
0
12 Nov 2022
LAION-5B: An open large-scale dataset for training next generation
  image-text models
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann
Romain Beaumont
Richard Vencu
Cade Gordon
Ross Wightman
...
Srivatsa Kundurthy
Katherine Crowson
Ludwig Schmidt
R. Kaczmarczyk
J. Jitsev
VLMMLLMCLIP
200
3,493
0
16 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
  Alignment
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
56
22
0
09 Oct 2022
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
  Answering over Images and Text
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
Wenhu Chen
Hexiang Hu
Xi Chen
Pat Verga
William W. Cohen
RALM
65
159
0
06 Oct 2022
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Wenhu Chen
Hexiang Hu
Chitwan Saharia
William W. Cohen
VLM
167
177
0
29 Sep 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science
  Question Answering
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELMReLMLRM
283
1,296
0
20 Sep 2022
TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
Felix Chern
Blake A. Hechtman
Andy Davis
Ruiqi Guo
David Majnemer
Surinder Kumar
133
24
0
28 Jun 2022
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
Dustin Schwenk
Apoorv Khandelwal
Christopher Clark
Kenneth Marino
Roozbeh Mottaghi
69
551
0
03 Jun 2022
End-to-End Multimodal Fact-Checking and Explanation Generation: A
  Challenging Dataset and Models
End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models
Barry Menglong Yao
Aditya Shah
Lichao Sun
Jin-Hee Cho
Lifu Huang
MLLMLRM
92
85
0
25 May 2022
Vision-and-Language Pretrained Models: A Survey
Vision-and-Language Pretrained Models: A Survey
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
68
65
0
15 Apr 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for
  Self-Supervised Video Pre-Training
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
226
1,205
0
23 Mar 2022
ChartQA: A Benchmark for Question Answering about Charts with Visual and
  Logical Reasoning
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry
Do Xuan Long
J. Tan
Shafiq Joty
Enamul Hoque
AIMat
134
684
0
19 Mar 2022
RACE: Retrieval-Augmented Commit Message Generation
RACE: Retrieval-Augmented Commit Message Generation
Ensheng Shi
Yanlin Wang
Wei Tao
Lun Du
Hongyu Zhang
Shi Han
Dongmei Zhang
Hongbin Sun
VLM
50
42
0
05 Mar 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
883
13,176
0
04 Mar 2022
Unsupervised Dense Information Retrieval with Contrastive Learning
Unsupervised Dense Information Retrieval with Contrastive Learning
Gautier Izacard
Mathilde Caron
Lucas Hosseini
Sebastian Riedel
Piotr Bojanowski
Armand Joulin
Edouard Grave
RALM
201
920
0
16 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIPVLM
104
715
0
08 Dec 2021
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann
Richard Vencu
Romain Beaumont
R. Kaczmarczyk
Clayton Mullis
Aarush Katta
Theo Coombes
J. Jitsev
Aran Komatsuzaki
VLMMLLMCLIP
243
1,441
0
03 Nov 2021
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
399
1,109
0
13 Oct 2021
Retrieval Augmented Code Generation and Summarization
Retrieval Augmented Code Generation and Summarization
Md. Rizwan Parvez
W. Ahmad
Saikat Chakraborty
Baishakhi Ray
Kai-Wei Chang
59
189
0
26 Aug 2021
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language
  Models
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
Zheyuan Liu
Cristian Rodriguez-Opazo
Damien Teney
Stephen Gould
VLM
64
203
0
09 Aug 2021
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel
Ari Holtzman
Maxwell Forbes
Ronan Le Bras
Yejin Choi
CLIP
148
1,582
0
18 Apr 2021
Retrieval Augmentation Reduces Hallucination in Conversation
Retrieval Augmentation Reduces Hallucination in Conversation
Kurt Shuster
Spencer Poff
Moya Chen
Douwe Kiela
Jason Weston
HILM
95
741
0
15 Apr 2021
MultiModalQA: Complex Question Answering over Text, Tables and Images
MultiModalQA: Complex Question Answering over Text, Tables and Images
Alon Talmor
Ori Yoran
Amnon Catav
Dan Lahav
Yizhong Wang
Akari Asai
Gabriel Ilharco
Hannaneh Hajishirzi
Jonathan Berant
LMTD
83
162
0
13 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
161
1,186
0
01 Apr 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
967
29,810
0
26 Feb 2021
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
Patrick Lewis
Yuxiang Wu
Linqing Liu
Pasquale Minervini
Heinrich Küttler
Aleksandra Piktus
Pontus Stenetorp
Sebastian Riedel
RALM
112
234
0
13 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy
  Text Supervision
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLMCLIP
456
3,893
0
11 Feb 2021
On Modality Bias in the TVQA Dataset
On Modality Bias in the TVQA Dataset
T. Winterbottom
S. Xiao
A. McLean
Noura Al Moubayed
63
35
0
18 Dec 2020
DocVQA: A Dataset for VQA on Document Images
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
144
743
0
01 Jul 2020
Rescaling Egocentric Vision
Rescaling Egocentric Vision
Dima Damen
Hazel Doughty
G. Farinella
Antonino Furnari
Evangelos Kazakos
...
Davide Moltisanti
Jonathan Munro
Toby Perrett
Will Price
Michael Wray
EgoV
94
465
0
23 Jun 2020
Counterfactual VQA: A Cause-Effect Look at Language Bias
Counterfactual VQA: A Cause-Effect Look at Language Bias
Yulei Niu
Kaihua Tang
Hanwang Zhang
Zhiwu Lu
Xiansheng Hua
Ji-Rong Wen
CML
117
402
0
08 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
859
42,379
0
28 May 2020
ColBERT: Efficient and Effective Passage Search via Contextualized Late
  Interaction over BERT
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Omar Khattab
Matei A. Zaharia
138
1,376
0
27 Apr 2020
Fashionpedia: Ontology, Segmentation, and an Attribute Localization
  Dataset
Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset
Menglin Jia
Mengyun Shi
Mikhail Sirotenko
Huayu Chen
Claire Cardie
B. Hariharan
Hartwig Adam
Serge J. Belongie
74
97
0
26 Apr 2020
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
  of Pre-Trained Transformers
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wenhui Wang
Furu Wei
Li Dong
Hangbo Bao
Nan Yang
Ming Zhou
VLM
158
1,280
0
25 Feb 2020
Clotho: An Audio Captioning Dataset
Clotho: An Audio Captioning Dataset
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
98
394
0
21 Oct 2019
PubLayNet: largest dataset ever for document layout analysis
PubLayNet: largest dataset ever for document layout analysis
Xu Zhong
Jianbin Tang
Antonio Jimeno Yepes
47
461
0
16 Aug 2019
ELI5: Long Form Question Answering
ELI5: Long Form Question Answering
Angela Fan
Yacine Jernite
Ethan Perez
David Grangier
Jason Weston
Michael Auli
AI4MHELM
103
624
0
22 Jul 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million
  Narrated Video Clips
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
118
1,207
0
07 Jun 2019
Previous
1234
Next