ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.12326
51
46

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

21 January 2025
Yujia Qin
Yining Ye
Junjie Fang
Han Wang
Shihao Liang
Shizuo Tian
Junda Zhang
Jiahao Li
Yuezun Li
Shijue Huang
Wanjun Zhong
Keqin Li
Jiale Yang
Yu Miao
Woyu Lin
Longxiang Liu
Xu Jiang
Qianli Ma
Junlong Li
Xiaojun Xiao
Kai Cai
Chong Li
Yaowei Zheng
Chaolin Jin
Cuiping Li
Xiao Zhou
Minchao Wang
Hao Chen
Zhiyu Li
Haihua Yang
Haifeng Liu
F. Lin
Tao Peng
Xin Liu
Guang Shi
    LLMAG
    LM&Ro
ArXivPDFHTML
Abstract

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

View on arXiv
Comments on this paper