FrontierCS: Evolving Challenges for Evolving Intelligence

17 December 2025

Qiuyang Mang

Wenhao Chai

Zhifei Li

Huanzhi Mao

Shang Zhou

Alexander Du

Hanchen Li

Shu Liu

Edwin Chen

Yichuan Wang

Xieting Chu

Zerui Cheng

Yuan Xu

Tian Xia

Zirui Wang

Tianneng Shi

Jianzhu Yao

Yilong Zhao

Qizheng Zhang

Charlie Ruan

Zeyu Shen

Kaiyuan Liu

Runyuan He

Dong Xing

Zerui Li

Zirong Zeng

Yige Jiang

Lufeng Cheng

Ziyi Zhao

Youran Sun

Wesley Zheng

Meiyuwang Zhang

Ruyi Ji

Xuechang Tu

Zihan Zheng

Zexing Chen

Kangyang Zhou

Zhaozi Wang

Jingbang Chen

Aleksandra Korolova

Peter Henderson

Pramod Viswanath

Vijay Ganesh

Saining Xie

Zhuang Liu

Dawn Song

Sewon Min

Ion Stoica

Joseph E. Gonzalez

Jingbo Shang

Alvin Cheung

ELM

LRM

ArXiv (abs)PDF HTML

Main:23 Pages

17 Figures

Bibliography:5 Pages

2 Tables

Abstract

We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.

View on arXiv

Comments on this paper