36

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan
Nicolas Baldwin
Edan Toledo
Rishi Hazra
Michael Kuchnik
Bassel Al Omari
Thomas Simon Foster
Anton Protopopov
Jean-Christophe Gagnon-Audet
Ishita Mediratta
Kelvin Niu
Michael Shvartsman
Alisia Lupidi
Alexis Audran-Reiss
Parth Pathak
Tatiana Shavrina
Despoina Magka
Hela Momand
Derek Dunfield
Nicola Cancedda
Pontus Stenetorp
Carole-Jean Wu
Jakob Nicolaus Foerster
Yoram Bachrach
Martin Josifoski
Main:13 Pages
5 Figures
Bibliography:3 Pages
3 Tables
Appendix:2 Pages
Abstract

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes overfitting and performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA2_2, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA2^{\dagger}_{2} achieves a mean Percentile Rank of 81.5% at 24 hours and 83.1% at 72 hours, outperforming the strongest baseline, which achieves 72.7%. On AIRS-Bench, AIRA2_2 exceeds human state-of-the-art on 6 out of 20 diverse research tasks. Ablations confirm that each architectural component is necessary, that performance follows a predictable scaling law that transfers across LLM backbones, and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

View on arXiv
Comments on this paper