v1v2 (latest)

COMPOSITE-Stem

10 April 2026

Kyle Waters

Lucas Nuzzi

Tadhg Looram

Alessandro Tomasiello

Ariel Ghislain Kemogne Kamdoum

Bikun Li

Damien Sileo

Egor Kretov

Francesco Fournier-Facio

Georgios Soloupis

Haile Kassahun

Hew Wolff

Jiaqi Cai

Lianghui Li

Marc Roth

Mohinder Naiya

Naixu Guo

Qicheng Tang

Richard Wheeler

Samuele Sala

Serguei Popov

Steven Dillmann

Yuqi Li

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)Github

Main:10 Pages

5 Figures

Bibliography:1 Pages

5 Tables

Appendix:2 Pages

Abstract

AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.

View on arXiv

Comments on this paper