Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2010.04295
Cited By
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
8 October 2020
Yong Li
Gang Li
Luheng He
Jingjie Zheng
Hong Li
Zhiwei Guan
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements"
28 / 28 papers shown
Title
UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
Longxi Gao
Li Zhang
Mengwei Xu
17
0
0
18 May 2025
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
Ada Chen
Yongjiang Wu
Jingyang Zhang
Shu Yang
Jen-tse Huang
Kun Wang
Wenxuan Wang
Shuai Wang
ELM
24
0
0
16 May 2025
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Xinyi Liu
Xiaoyi Zhang
Ziyun Zhang
Yan Lu
42
0
0
15 Apr 2025
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Shravan Nayak
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
...
David Vazquez
Christopher Pal
Perouz Taslakian
Spandana Gella
Sai Rajeswar
299
1
0
19 Mar 2025
SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang
Ziming Cheng
Junting Pan
Zhaohui Hou
Mingjie Zhan
LLMAG
101
2
0
05 Mar 2025
MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions
Yuxuan Liu
Hongda Sun
Wei Liu
Jian Luan
Bo Du
Rui Yan
63
2
0
24 Feb 2025
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Yunxing Liu
Pengxiang Li
Zishu Wei
C. Xie
Xueyu Hu
Xinchen Xu
Shengyu Zhang
Xiaotian Han
Hongxia Yang
Fei Wu
LLMAG
LRM
63
11
0
08 Jan 2025
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou
Ruohan Wang
Boyuan Zheng
Yanan Xie
Cheng Chang
Yiheng Shu
Huan Sun
Yu Su
LM&Ro
LLMAG
84
57
0
07 Oct 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Tianle Gu
Zeyang Zhou
Kexin Huang
Dandan Liang
Yixu Wang
...
Keqing Wang
Yujiu Yang
Yan Teng
Yu Qiao
Yingchun Wang
ELM
55
13
0
11 Jun 2024
MUD: Towards a Large-Scale and Noise-Filtered UI Dataset for Modern Style UI Modeling
Sidong Feng
Suyu Ma
Han Wang
David Kong
Chunyang Chen
50
9
0
11 May 2024
Benchmarking Mobile Device Control Agents across Diverse Configurations
Juyong Lee
Taywon Min
Minyong An
Changyeon Kim
Kimin Lee
46
10
0
25 Apr 2024
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu
Wen Liu
Bo Zhang
Bing-Li Wang
Kai Dong
...
Yaofeng Sun
Chengqi Deng
Hanwei Xu
Zhenda Xie
Chong Ruan
VLM
41
304
0
08 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
72
12
0
05 Mar 2024
AI Assistance for UX: A Literature Review Through Human-Centered AI
Yuwen Lu
Yuewen Yang
Qinyi Zhao
Chengzhi Zhang
Toby Jia-Jun Li
34
16
0
08 Feb 2024
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLM
VLM
41
94
0
13 Oct 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
76
190
0
29 May 2023
MenuCraft: Interactive Menu System Design with Large Language Models
Amir Hossein Kargaran
Nafiseh Nikeghbal
Abbas Heydarnoori
Hinrich Schütze
LLMAG
38
4
0
08 Mar 2023
Screen Correspondence: Mapping Interchangeable Elements between UIs
Jason Wu
Amanda Swearngin
Xiaoyi Zhang
Jeffrey Nichols
Jeffrey P. Bigham
48
7
0
20 Jan 2023
UGIF: UI Grounded Instruction Following
S. Venkatesh
Partha P. Talukdar
S. Narayanan
21
10
0
14 Nov 2022
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
169
266
0
07 Oct 2022
MUG: Interactive Multimodal Grounding on User Interfaces
Tao Li
Gang Li
Jingjie Zheng
Purple Wang
Yang Li
LLMAG
46
8
0
29 Sep 2022
Enabling Conversational Interaction with Mobile UI using Large Language Models
Bryan Wang
Gang Li
Yang Li
188
132
0
18 Sep 2022
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
Yu-Chung Hsiao
Fedir Zubach
Maria Wang
Jindong Chen
Victor Carbune
Jason Lin
Maria Wang
Yun Zhu
Jindong Chen
RALM
160
26
0
16 Sep 2022
Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at Scale
Gang Li
Gilles Baechler
Manuel Tragut
Yang Li
29
49
0
11 Jan 2022
VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling
Yang Li
Gang Li
Xin Zhou
Mostafa Dehghani
A. Gritsenko
MLLM
45
35
0
10 Dec 2021
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
170
154
0
07 Aug 2021
UIBert: Learning Generic Multimodal Representations for UI Understanding
Chongyang Bai
Xiaoxue Zang
Ying Xu
Srinivas Sunkara
Abhinav Rastogi
Jindong Chen
Blaise Agüera y Arcas
27
87
0
29 Jul 2021
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
Toby Jia-Jun Li
Lindsay Popowski
Tom Michael Mitchell
Brad A. Myers
19
104
0
11 Jan 2021
1