ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.11317
  4. Cited By
GUICourse: From General Vision Language Models to Versatile GUI Agents

GUICourse: From General Vision Language Models to Versatile GUI Agents

17 June 2024
Wentong Chen
Junbo Cui
Jinyi Hu
Yujia Qin
Junjie Fang
Yue Zhao
Chongyi Wang
Jun Liu
Guirong Chen
Yupeng Huo
Yuan Yao
Yankai Lin
Zhiyuan Liu
Maosong Sun
    LLMAG
ArXivPDFHTML

Papers citing "GUICourse: From General Vision Language Models to Versatile GUI Agents"

50 / 56 papers shown
Title
TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments
TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments
Yuheng Lu
Qian Yu
Hongru Wang
Zeming Liu
Wei Su
Yanping Liu
Yuhang Guo
Maocheng Liang
Yunhong Wang
Haifeng Wang
LLMAG
111
0
0
23 May 2025
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Bofei Zhang
Zirui Shang
Zhi Gao
Wang Zhang
Rui Xie
Xiaojian Ma
Tao Yuan
Xinxiao Wu
Song-Chun Zhu
Qing Li
LLMAG
87
3
0
17 Apr 2025
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo
Lu Wang
Wanwei He
Xiaobo Xia
LLMAG
99
28
0
14 Apr 2025
SpiritSight Agent: Advanced GUI Agent with One Look
SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang
Ziming Cheng
Junting Pan
Zhaohui Hou
Mingjie Zhan
LLMAG
121
3
0
05 Mar 2025
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
Vardaan Pahuja
Yadong Lu
Corby Rosset
Boyu Gou
Arindam Mitra
Spencer Whitehead
Yu Su
Ahmed Awadallah
LLMAG
LM&Ro
Presented at ResearchTrend Connect | LLMAG on 14 Mar 2025
185
5
1
17 Feb 2025
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Yunxing Liu
Pengxiang Li
Zishu Wei
C. Xie
Xueyu Hu
Xinchen Xu
Shengyu Zhang
Xiaotian Han
Hongxia Yang
Leilei Gan
LLMAG
LRM
91
15
0
08 Jan 2025
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
S. Yu
C. Tang
Bokai Xu
Junbo Cui
Junhao Ran
...
Zhenghao Liu
Shuo Wang
Xu Han
Zhiyuan Liu
Maosong Sun
VLM
119
30
0
14 Oct 2024
TinyClick: Single-Turn Agent for Empowering GUI Automation
TinyClick: Single-Turn Agent for Empowering GUI Automation
Pawel Pawlowski
Krystian Zawistowski
Wojciech Lapacz
Marcin Skorupa
Adam Wiacek
Sebastien Postansque
Jakub Hoscilowicz
LRM
LLMAG
MLLM
72
6
0
09 Oct 2024
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao
Tianyu Yu
Ao Zhang
Chongyi Wang
Junbo Cui
...
Xu Han
Guoyang Zeng
Dahai Li
Zhiyuan Liu
Maosong Sun
VLM
MLLM
72
403
0
03 Aug 2024
Exploring Perceptual Limitation of Multimodal Large Language Models
Exploring Perceptual Limitation of Multimodal Large Language Models
Jiarui Zhang
Jinyi Hu
Mahyar Khayatkhoei
Filip Ilievski
Maosong Sun
LRM
45
10
0
12 Feb 2024
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Xing Han Lù
Zdeněk Kasner
Siva Reddy
49
67
0
08 Feb 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
139
51
0
07 Feb 2024
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual
  Perception
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Junyang Wang
Haiyang Xu
Jiabo Ye
Mingshi Yan
Weizhou Shen
Ji Zhang
Fei Huang
Jitao Sang
69
119
0
29 Jan 2024
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng
Qiushi Sun
Yougang Chu
Fangzhi Xu
Yantao Li
Jianbing Zhang
Zhiyong Wu
LLMAG
194
163
0
17 Jan 2024
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
Yichen Zhu
Minjie Zhu
Ning Liu
Zhicai Ou
Xiaofeng Mou
Jian Tang
102
96
0
04 Jan 2024
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu
Saining Xie
LRM
73
143
0
21 Dec 2023
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
  Fine-grained Correctional Human Feedback
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
M. Steyvers
Yuan Yao
Haoye Zhang
Taiwen He
Yifeng Han
...
Xinyue Hu
Zhiyuan Liu
Hai-Tao Zheng
Maosong Sun
Tat-Seng Chua
MLLM
VLM
163
198
0
01 Dec 2023
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone
  GUI Navigation
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
An Yan
Zhengyuan Yang
Wanrong Zhu
Kevin Qinghong Lin
Linjie Li
...
Yiwu Zhong
Julian McAuley
Jianfeng Gao
Zicheng Liu
Lijuan Wang
LLMAG
LM&Ro
100
105
0
13 Nov 2023
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu
Hao Cheng
Haotian Liu
Hao Zhang
Feng Li
...
Hang Su
Jun Zhu
Lei Zhang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
76
118
0
09 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
OtterHD: A High-Resolution Multi-modality Model
Yue Liu
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLM
MLLM
59
65
0
07 Nov 2023
Qwen Technical Report
Qwen Technical Report
Jinze Bai
Shuai Bai
Yunfei Chu
Zeyu Cui
Kai Dang
...
Zhenru Zhang
Chang Zhou
Jingren Zhou
Xiaohuan Zhou
Tianhang Zhu
OSLM
174
1,756
0
28 Sep 2023
Aligning Large Multimodal Models with Factually Augmented RLHF
Aligning Large Multimodal Models with Factually Augmented RLHF
Zhiqing Sun
Sheng Shen
Shengcao Cao
Haotian Liu
Chunyuan Li
...
Liangyan Gui
Yu-Xiong Wang
Yiming Yang
Kurt Keutzer
Trevor Darrell
VLM
84
351
0
25 Sep 2023
You Only Look at Screens: Multimodal Chain-of-Action Agents
You Only Look at Screens: Multimodal Chain-of-Action Agents
Zhuosheng Zhang
Aston Zhang
LLMAG
LM&Ro
40
107
0
20 Sep 2023
LASER: LLM Agent with State-Space Exploration for Web Navigation
LASER: LLM Agent with State-Space Exploration for Web Navigation
Kaixin Ma
Hongming Zhang
Hongwei Wang
Xiaoman Pan
Wenhao Yu
Dong Yu
LLMAG
46
41
0
15 Sep 2023
Nougat: Neural Optical Understanding for Academic Documents
Nougat: Neural Optical Understanding for Academic Documents
Lukas Blecher
Guillem Cucurull
Thomas Scialom
Robert Stojnic
ViT
39
114
0
25 Aug 2023
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across
  Languages
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Jinyi Hu
Yuan Yao
Chong Wang
Shanonan Wang
Yinxu Pan
...
Yankai Lin
Jiao Xue
Dahai Li
Zhiyuan Liu
Maosong Sun
MLLM
VLM
57
53
0
23 Aug 2023
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou
Frank F. Xu
Hao Zhu
Xuhui Zhou
Robert Lo
...
Tianyue Ou
Yonatan Bisk
Daniel Fried
Uri Alon
Graham Neubig
LLMAG
90
420
0
25 Jul 2023
A Real-World WebAgent with Planning, Long Context Understanding, and
  Program Synthesis
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Izzeddin Gur
Hiroki Furuta
Austin Huang
Mustafa Safdari
Yutaka Matsuo
Douglas Eck
Aleksandra Faust
LM&Ro
LLMAG
90
210
0
24 Jul 2023
Android in the Wild: A Large-Scale Dataset for Android Device Control
Android in the Wild: A Large-Scale Dataset for Android Device Control
Christopher Rawles
Alice Li
Daniel Rodriguez
Oriana Riva
Timothy Lillicrap
LM&Ro
56
146
0
19 Jul 2023
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng
Wenhui Wang
Li Dong
Y. Hao
Shaohan Huang
Shuming Ma
Furu Wei
MLLM
ObjD
VLM
89
735
0
26 Jun 2023
AssistGPT: A General Multi-modal Assistant that can Plan, Execute,
  Inspect, and Learn
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Difei Gao
Lei Ji
Luowei Zhou
Kevin Lin
Joya Chen
Zihan Fan
Mike Zheng Shou
MLLM
61
74
0
14 Jun 2023
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer
  Control
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
Longtao Zheng
Rongpin Wang
Xinrun Wang
Bo An
LLMAG
41
61
0
13 Jun 2023
Mind2Web: Towards a Generalist Agent for the Web
Mind2Web: Towards a Generalist Agent for the Web
Xiang Deng
Yu Gu
Boyuan Zheng
Shijie Chen
Samuel Stevens
Boshi Wang
Huan Sun
Yu-Chuan Su
LLMAG
63
448
0
09 Jun 2023
From Pixels to UI Actions: Learning to Follow Instructions via Graphical
  User Interfaces
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
Peter Shaw
Mandar Joshi
James Cohan
Jonathan Berant
Panupong Pasupat
Hexiang Hu
Urvashi Khandelwal
Kenton Lee
Kristina Toutanova
LLMAG
LM&Ro
52
55
0
31 May 2023
AdaPlanner: Adaptive Planning from Feedback with Language Models
AdaPlanner: Adaptive Planning from Feedback with Language Models
Haotian Sun
Yuchen Zhuang
Lingkai Kong
Bo Dai
Chao Zhang
LLMAG
46
131
0
26 May 2023
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Hiroki Furuta
Kuang-Huei Lee
Ofir Nachum
Yutaka Matsuo
Aleksandra Faust
S. Gu
Izzeddin Gur
LM&Ro
81
96
0
19 May 2023
WebCPM: Interactive Web Search for Chinese Long-form Question Answering
WebCPM: Interactive Web Search for Chinese Long-form Question Answering
Yujia Qin
Zihan Cai
Di Jin
Lan Yan
Shi Liang
...
Ruobing Xie
Fanchao Qi
Zhiyuan Liu
Maosong Sun
Jie Zhou
RALM
42
92
0
11 May 2023
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
  Language Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu
Jun Chen
Xiaoqian Shen
Xiang Li
Mohamed Elhoseiny
VLM
MLLM
105
1,978
0
20 Apr 2023
Visual Instruction Tuning
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
370
4,607
0
17 Apr 2023
Language Models can Solve Computer Tasks
Language Models can Solve Computer Tasks
Geunwoo Kim
Pierre Baldi
Stephen Marcus McAleer
LLMAG
LM&Ro
86
350
0
30 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
938
12,840
0
27 Feb 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
391
4,465
0
30 Jan 2023
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
  Understanding
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
239
272
0
07 Oct 2022
WebShop: Towards Scalable Real-World Web Interaction with Grounded
  Language Agents
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
Shunyu Yao
Howard Chen
John Yang
Karthik Narasimhan
LLMAG
LM&Ro
65
472
0
04 Jul 2022
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
Liangtai Sun
Xingyu Chen
Lu Chen
Tianle Dai
Zichen Zhu
Kai Yu
LLMAG
53
54
0
23 May 2022
A data-driven approach for learning to control computers
A data-driven approach for learning to control computers
Peter C. Humphreys
David Raposo
Tobias Pohlen
Gregory Thornton
Rachita Chhaparia
...
Josh Abramson
Petko Georgiev
Alex Goldin
Adam Santoro
Timothy Lillicrap
45
99
0
16 Feb 2022
A Dataset for Interactive Vision-Language Navigation with Unknown
  Command Feasibility
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
Andrea Burns
Deniz Arsan
Sanjna Agrawal
Ranjitha Kumar
Kate Saenko
Bryan A. Plummer
65
62
0
04 Feb 2022
WebGPT: Browser-assisted question-answering with human feedback
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALM
RALM
169
1,241
0
17 Dec 2021
UIBert: Learning Generic Multimodal Representations for UI Understanding
UIBert: Learning Generic Multimodal Representations for UI Understanding
Chongyang Bai
Xiaoxue Zang
Ying Xu
Srinivas Sunkara
Abhinav Rastogi
Jindong Chen
Blaise Agüera y Arcas
54
92
0
29 Jul 2021
Grounding Open-Domain Instructions to Automate Web Support Tasks
Grounding Open-Domain Instructions to Automate Web Support Tasks
N. Xu
Sam Masling
Michael Du
Giovanni Campagna
Larry Heck
James A. Landay
M. Lam
LLMAG
AI4TS
30
41
0
30 Mar 2021
12
Next