AIRA_2: Overcoming Bottlenecks in AI Research Agents

27 March 2026

Karen Hambardzumyan

Nicolas Baldwin

Edan Toledo

Rishi Hazra

Michael Kuchnik

Bassel Al Omari

Thomas Simon Foster

Anton Protopopov

Jean-Christophe Gagnon-Audet

Ishita Mediratta

Kelvin Niu

Michael Shvartsman

Alisia Lupidi

Alexis Audran-Reiss

Parth Pathak

Tatiana Shavrina

Despoina Magka

Hela Momand

Derek Dunfield

Nicola Cancedda

Pontus Stenetorp

Carole-Jean Wu

Jakob Nicolaus Foerster

Yoram Bachrach

Martin Josifoski

AI4CE

ArXiv (abs)PDF HTML Github

Main:13 Pages

5 Figures

Bibliography:3 Pages

3 Tables

Appendix:2 Pages

Abstract

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes overfitting and performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA $_2$ , which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA $^{\dagger}_{2}$ achieves a mean Percentile Rank of 81.5% at 24 hours and 83.1% at 72 hours, outperforming the strongest baseline, which achieves 72.7%. On AIRS-Bench, AIRA $_2$ exceeds human state-of-the-art on 6 out of 20 diverse research tasks. Ablations confirm that each architectural component is necessary, that performance follows a predictable scaling law that transfers across LLM backbones, and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

View on arXiv

Comments on this paper