Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.15116
Cited By
Large Multimodal Agents: A Survey
23 February 2024
Junlin Xie
Zhihong Chen
Ruifei Zhang
Xiang Wan
Guanbin Li
LM&Ro
LLMAG
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Large Multimodal Agents: A Survey"
29 / 29 papers shown
Title
Manipulating Multimodal Agents via Cross-Modal Prompt Injection
Le Wang
Zonghao Ying
Tianyuan Zhang
Siyuan Liang
Shengshan Hu
Mingchuan Zhang
A. Liu
Xianglong Liu
AAML
118
2
0
19 Apr 2025
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension
Runwei Guan
Ruixiao Zhang
Ningwei Ouyang
Jianan Liu
Ka Lok Man
...
Ming Xu
Jeremy S. Smith
Eng Gee Lim
Yutao Yue
Hui Xiong
163
9
0
21 May 2024
MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion
Sen Li
Ruochen Wang
Cho-Jui Hsieh
Minhao Cheng
Tianyi Zhou
MLLM
LM&Ro
61
3
0
20 Feb 2024
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Xing Han Lù
Zdeněk Kasner
Siva Reddy
78
72
0
08 Feb 2024
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Jian Xie
Kai Zhang
Jiangjie Chen
Tinghui Zhu
Renze Lou
Yuandong Tian
Yanghua Xiao
Yu-Chuan Su
LLMAG
LM&Ro
112
163
0
02 Feb 2024
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Junyang Wang
Haiyang Xu
Jiabo Ye
Mingshi Yan
Weizhou Shen
Ji Zhang
Fei Huang
Jitao Sang
100
125
0
29 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLM
LLMAG
102
18
0
19 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
101
40
0
16 Jan 2024
Supervised Knowledge Makes Large Language Models Better In-context Learners
Linyi Yang
Shuibai Zhang
Zhuohao Yu
Guangsheng Bao
Yidong Wang
...
Ruochen Xu
Weirong Ye
Xing Xie
Weizhu Chen
Yue Zhang
81
19
0
26 Dec 2023
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
Difei Gao
Lei Ji
Zechen Bai
Mingyu Ouyang
Peiran Li
...
Peiyi Wang
Xiangwu Guo
Hengxu Wang
Luowei Zhou
Mike Zheng Shou
LLMAG
51
23
0
20 Dec 2023
See and Think: Embodied Agent in Virtual Environment
Zhonghan Zhao
Wenhao Chai
Xuan Wang
Li Boyi
Shengyu Hao
Shidong Cao
Tianbo Ye
Gaoang Wang
LM&Ro
LLMAG
83
37
0
26 Nov 2023
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
An Yan
Zhengyuan Yang
Wanrong Zhu
Kevin Qinghong Lin
Linjie Li
...
Yiwu Zhong
Julian McAuley
Jianfeng Gao
Zicheng Liu
Lijuan Wang
LLMAG
LM&Ro
116
110
0
13 Nov 2023
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu
Hao Cheng
Haotian Liu
Hao Zhang
Feng Li
...
Hang Su
Jun Zhu
Lei Zhang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
78
119
0
09 Nov 2023
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Licheng Wen
Xuemeng Yang
Daocheng Fu
Xiaofeng Wang
Pinlong Cai
...
Xinyu Cai
Min Dou
Shuanglu Hu
Botian Shi
Yu Qiao
VLM
79
83
0
09 Nov 2023
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Wei-Ge Chen
Irina Spiridonova
Jianwei Yang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
49
36
0
01 Nov 2023
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
Dingyao Yu
Kaitao Song
Peiling Lu
Tianyu He
Xu Tan
Wei Ye
Shikun Zhang
Jiang Bian
LLMAG
73
16
0
18 Oct 2023
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance
Jesse Zhang
Jiahui Zhang
Karl Pertsch
Ziyi Liu
Xiang Ren
Minsuk Chang
Shao-Hua Sun
Joseph J Lim
LLMAG
LM&Ro
141
63
0
16 Oct 2023
Towards Robust Multi-Modal Reasoning via Model Selection
Xiangyan Liu
Rongxue Li
Wei Ji
Tao Lin
LLMAG
LRM
60
5
0
12 Oct 2023
How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances
Zihan Zhang
Meng Fang
Lingxi Chen
Mohammad-Reza Namazi-Rad
Jun Wang
KELM
55
24
0
11 Oct 2023
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
Lifan Yuan
Yangyi Chen
Xingyao Wang
Yi R. Fung
Hao Peng
Heng Ji
LLMAG
KELM
88
65
0
29 Sep 2023
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi
Wenxiang Chen
Xin Guo
Wei He
Yiwen Ding
...
Wenjuan Qin
Yongyan Zheng
Xipeng Qiu
Xuanjing Huan
Tao Gui
LM&MA
LM&Ro
3DV
AI4CE
102
924
0
14 Sep 2023
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Difei Gao
Lei Ji
Luowei Zhou
Kevin Lin
Joya Chen
Zihan Fan
Mike Zheng Shou
MLLM
69
76
0
14 Jun 2023
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai
Junnan Li
Dongxu Li
A. M. H. Tiong
Junqi Zhao
Weisheng Wang
Boyang Albert Li
Pascale Fung
Steven C. H. Hoi
MLLM
VLM
104
2,049
0
11 May 2023
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
E. Azarnasab
Faisal Ahmed
Zicheng Liu
Ce Liu
Michael Zeng
Lijuan Wang
ReLM
KELM
LRM
88
383
0
20 Mar 2023
Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework
Botao Ye
Hong Chang
Bingpeng Ma
Shiguang Shan
Xilin Chen
ViT
87
462
0
22 Mar 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
524
4,343
0
28 Jan 2022
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach
A. Blattmann
Dominik Lorenz
Patrick Esser
Bjorn Ommer
3DV
408
15,486
0
20 Dec 2021
A Systematic Investigation of Commonsense Knowledge in Large Language Models
Xiang Lorraine Li
A. Kuncoro
Jordan Hoffmann
Cyprien de Masson dÁutume
Phil Blunsom
Aida Nematzadeh
LRM
68
58
0
31 Oct 2021
PaddleSeg: A High-Efficient Development Toolkit for Image Segmentation
Yi Liu
Lutao Chu
Guowei Chen
Zewu Wu
Zeyu Chen
Baohua Lai
Yuying Hao
VLM
SSeg
96
71
0
15 Jan 2021
1