We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency. Relevant improvement comes from multi-task training and MLLM-based data augmentation. Manually annotated corpora are scarce, but we show that MLLM augmentation might produce better results. On Screenspot and OmniAct, our model outperforms both GUI-specific models (e.g., SeeClick) and MLLMs (e.g., GPT-4V).

View on arXiv

@article{pawlowski2025_2410.11871,
  title={ TinyClick: Single-Turn Agent for Empowering GUI Automation },
  author={ Pawel Pawlowski and Krystian Zawistowski and Wojciech Lapacz and Adam Wiacek and Marcin Skorupa and Sebastien Postansque and Jakub Hoscilowicz },
  journal={arXiv preprint arXiv:2410.11871},
  year={ 2025 }
}

Comments on this paper