Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.13549
Cited By
A Survey on Multimodal Large Language Models
23 June 2023
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Xu
Enhong Chen
MLLM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Survey on Multimodal Large Language Models"
50 / 134 papers shown
Title
Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?
Tairan Fu
Miguel González
Javier Conde
Elena Merino-Gómez
Pedro Reviriego
12
0
0
16 May 2025
MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception
Zhengye Zhang
Sirui Zhao
Shifeng Liu
Shukang Yin
Xinglong Mao
Tong Xu
Enhong Chen
MLLM
61
0
0
11 May 2025
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Giulio Cesare Mastrocinque Santo
Patrícia Izar
Irene Delval
Victor de Napole Gregolin
Nina S. T. Hirata
VGen
42
0
0
08 May 2025
VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models
Fahad Shahbaz Khan
Jun Chen
Youssef Mohamed
Chun-Mei Feng
Mohamed Elhoseiny
VLM
37
0
0
08 May 2025
CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models
Fabian Wolf
Oliver Tüselmann
Arthur Matei
Lukas Hennies
Christoph Rass
Gernot A. Fink
53
0
0
07 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Jiahui Geng
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
74
0
0
05 May 2025
Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks
Baoxia Du
H. Du
Dusit Niyato
Ruidong Li
60
0
0
05 May 2025
DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion
Haoteng Li
Zhao Yang
Zezhong Qian
Gongpeng Zhao
Yuqi Huang
Jun-chen Yu
Huazheng Zhou
Longjun Liu
130
1
0
03 May 2025
Simple Visual Artifact Detection in Sora-Generated Videos
Misora Sugiyama
Hirokatsu Kataoka
EGVM
61
0
0
30 Apr 2025
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis
Daisuke Niizumi
Daiki Takeuchi
Masahiro Yasuda
Binh Thien Nguyen
Yasunori Ohishi
N. Harada
49
0
0
25 Apr 2025
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Hanlei Zhang
Zhuohang Li
Yeshuang Zhu
Hua Xu
Peiwu Wang
Haige Zhu
Jie Zhou
Jinchao Zhang
43
0
0
23 Apr 2025
A Survey of AI Agent Protocols
Yuqing Yang
Huacan Chai
Yangqiu Song
S. Qi
Muning Wen
...
Gaowei Chang
Wei Liu
Ying Wen
Yong Yu
Wenbo Zhang
LLMAG
72
2
0
23 Apr 2025
DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging
Tianhui Song
Weixin Feng
Shuai Wang
Xinfeng Li
Tiezheng Ge
Bo Zheng
Limin Wang
MoMe
62
0
0
16 Apr 2025
Spatial Audio Processing with Large Language Model on Wearable Devices
Ayushi Mishra
Yang Bai
Priyadarshan Narayanasamy
Nakul Garg
Nirupam Roy
30
0
0
11 Apr 2025
Mimic In-Context Learning for Multimodal Tasks
Yuchu Jiang
Jiale Fu
Chenduo Hao
Xinting Hu
Yingzhe Peng
Xin Geng
Xu Yang
32
0
0
11 Apr 2025
Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation
Yong Li
Bo Liu
Sheng Huang
Zhe Zhang
Xiaotong Yuan
Richang Hong
46
0
0
31 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
77
0
0
26 Mar 2025
GOAL: Global-local Object Alignment Learning
Hyungyu Choi
Young Kyun Jang
Chanho Eom
VLM
186
0
0
22 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
Jianxiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
49
1
0
17 Mar 2025
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Jonas Belouadi
Eddy Ilg
M. Keuper
Hideki Tanaka
Masao Utiyama
Raj Dabre
Steffen Eger
Simone Paolo Ponzetto
52
0
0
14 Mar 2025
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Bardia Safaei
Faizan Siddiqui
Jiacong Xu
Vishal M. Patel
Shao-Yuan Lo
VLM
217
0
0
10 Mar 2025
Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
Kun Xiang
Zhili Liu
Zihao Jiang
Yunshuang Nie
Kaixin Cai
...
Yu-Jie Yuan
J. Han
Lanqing Hong
Hang Xu
Xiaodan Liang
ReLM
LRM
62
6
0
08 Mar 2025
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Wei-Yao Wang
Zhao Wang
Helen Suzuki
Yoshiyuki Kobayashi
LRM
58
1
0
04 Mar 2025
Re-Imagining Multimodal Instruction Tuning: A Representation View
Yiyang Liu
James Liang
Ruixiang Tang
Yugyung Lee
Majid Rabbani
...
Raghuveer M. Rao
Lifu Huang
Dongfang Liu
Qifan Wang
Cheng Han
186
0
0
02 Mar 2025
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
Zhongyang Li
Ziyue Li
Dinesh Manocha
MoE
53
0
0
27 Feb 2025
Tracking the Copyright of Large Vision-Language Models through Parameter Learning Adversarial Images
Yubo Wang
Jianting Tang
Chaohu Liu
Linli Xu
AAML
63
1
0
23 Feb 2025
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
Lijun Li
Zhelun Shi
Xuhao Hu
Bowen Dong
Yiran Qin
Xihui Liu
Lu Sheng
Jing Shao
114
1
0
21 Feb 2025
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education
Yi-Fan Zhang
Hang Li
D. Song
Lichao Sun
Tianlong Xu
Qingsong Wen
LLMAG
LRM
93
2
0
20 Feb 2025
From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
Jian Jia
Jingtong Gao
Ben Xue
Junhao Wang
Qingpeng Cai
Quan Chen
Xiangyu Zhao
Peng Jiang
Kun Gai
OffRL
77
0
0
18 Feb 2025
GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
Yi Fang
Bowen Jin
Jiacheng Shen
Sirui Ding
Qiaoyu Tan
J. Han
59
1
0
17 Feb 2025
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision Team
Leonid Karlinsky
Assaf Arbelle
Abraham Daniels
A. Nassar
...
Sriram Raghavan
T. Syeda-Mahmood
Peter W. J. Staar
Tal Drory
Rogerio Feris
VLM
AI4TS
114
0
0
14 Feb 2025
Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model
Shiryu Ueno
Yoshikazu Hayashi
Shunsuke Nakatsuka
Yusei Yamada
Hiroaki Aizawa
K. Kato
MLLM
VLM
105
0
0
13 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
70
3
0
11 Feb 2025
Large Multimodal Models for Low-Resource Languages: A Survey
Marian Lupascu
Ana-Cristina Rogoz
Mihai-Sorin Stupariu
Radu Tudor Ionescu
69
1
0
08 Feb 2025
Large Language Models for Multi-Robot Systems: A Survey
Peihan Li
Zijian An
Shams Abrar
Lifeng Zhou
LM&Ro
LRM
62
6
0
06 Feb 2025
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Jinyang Wu
Mingkuan Feng
Shuai Zhang
Ruihan Jin
Feihu Che
Zengqi Wen
J. Tao
LRM
68
8
0
04 Feb 2025
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation
Bin Zhu
Hui yan Qi
Yinxuan Gui
Jingjing Chen
Chong-Wah Ngo
Ee-Peng Lim
166
1
0
31 Jan 2025
Baichuan-Omni-1.5 Technical Report
Yadong Li
Qingbin Liu
Tao Zhang
Tao Zhang
Tian Jin
...
Jianhua Xu
Haoze Sun
Mingan Lin
Zenan Zhou
Xin Wu
AuLLM
75
13
0
28 Jan 2025
Addressing Bias in Generative AI: Challenges and Research Opportunities in Information Management
Xiahua Wei
Naveen Kumar
Han Zhang
68
5
0
22 Jan 2025
Owls are wise and foxes are unfaithful: Uncovering animal stereotypes in vision-language models
Tabinda Aman
Mohammad Nadeem
S. Sohail
Mohammad Anas
Min Zhang
VLM
61
1
0
21 Jan 2025
Visual RAG: Expanding MLLM visual knowledge without fine-tuning
Mirco Bonomo
Simone Bianco
VLM
75
5
0
18 Jan 2025
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models
Abdulkadir Erol
Trilok Padhi
Agnik Saha
Ugur Kursuncu
Mehmet Emin Aktas
49
1
0
17 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
81
2
0
10 Jan 2025
Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner
Yitong Zhou
Mingyue Cheng
Qingyang Mao
Qi Liu
F. Xu
Xin Li
Enhong Chen
LMTD
44
0
0
30 Dec 2024
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang
Yanzhao Zhang
Wen Xie
Mingxin Li
Ziqi Dai
Dingkun Long
Pengjun Xie
Meishan Zhang
Wenjie Li
Hao Fei
118
9
0
22 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
104
5
0
05 Dec 2024
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Zhangqi Jiang
Junkai Chen
Beier Zhu
Tingjin Luo
Yankun Shen
Xu Yang
108
4
0
23 Nov 2024
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
D. Song
Sicheng Lai
Shunian Chen
Lichao Sun
Benyou Wang
207
0
0
06 Nov 2024
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
Yangning Li
Hai-Tao Zheng
Xinyu Wang
Yong Jiang
Zhen Zhang
...
Hui Wang
Hai-Tao Zheng
Pengjun Xie
Philip S. Yu
Fei Huang
65
18
0
05 Nov 2024
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li
Jianghong Ma
Xiaofeng Zhang
Yuhang Li
Jianyang Shi
43
0
0
26 Oct 2024
1
2
3
Next