v1v2 (latest)

CocoaBench: Evaluating Unified Digital Agents in the Wild

13 April 2026

CocoaBench Team

Shibo Hao

Zhining Zhang

Zhiqi Liang

Tianyang Liu

Yuheng Zha

Qiyue Gao

Jixuan Chen

Zilong Wang

Zhoujun Cheng

Haoxiang Zhang

Junli Wang

Hexi Jin

Boyuan Zheng

Kun Zhou

Yu Wang

Feng Yao

Licheng Liu

Yijiang Li

Zhifei Li

Zhengtao Han

Pracha Promthaw

Tommaso Cerruti

Xiaohan Fu

Ziqiao Ma

Jingbo Shang

Lianhui Qin

Julian McAuley

Eric P. Xing

Zhengzhong Liu

Rupesh Kumar Srivastava

Zhiting Hu

LLMAG

ELM

ArXiv (abs)PDF HTML Github (4266★)

Main:9 Pages

16 Figures

Bibliography:2 Pages

3 Tables

Appendix:15 Pages

Abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

View on arXiv

Comments on this paper