12
v1v2 (latest)

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team
Shibo Hao
Zhining Zhang
Zhiqi Liang
Tianyang Liu
Yuheng Zha
Qiyue Gao
Jixuan Chen
Zilong Wang
Zhoujun Cheng
Haoxiang Zhang
Junli Wang
Hexi Jin
Boyuan Zheng
Kun Zhou
Yu Wang
Feng Yao
Licheng Liu
Yijiang Li
Zhifei Li
Zhengtao Han
Pracha Promthaw
Tommaso Cerruti
Xiaohan Fu
Ziqiao Ma
Jingbo Shang
Lianhui Qin
Julian McAuley
Eric P. Xing
Zhengzhong Liu
Rupesh Kumar Srivastava
Zhiting Hu
Main:9 Pages
16 Figures
Bibliography:2 Pages
3 Tables
Appendix:15 Pages
Abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

View on arXiv
Comments on this paper