[Submitted on 27 Mar 2026 (

v1

), last revised 13 Apr 2026 (this version, v2)]

Title:AIRA_2: Overcoming Bottlenecks in AI Research Agents

Authors:Karen Hambardzumyan

,

Nicolas Baldwin

,

Edan Toledo

,

Rishi Hazra

,

Michael Kuchnik

,

Bassel Al Omari

,

Thomas Simon Foster

,

Anton Protopopov

,

Jean-Christophe Gagnon-Audet

,

Ishita Mediratta

,

Kelvin Niu

,

Michael Shvartsman

,

Alisia Lupidi

,

Alexis Audran-Reiss

,

Parth Pathak

,

Tatiana Shavrina

,

Despoina Magka

,

Hela Momand

,

Derek Dunfield

,

Nicola Cancedda

,

Pontus Stenetorp

,

Carole-Jean Wu

,

Jakob Nicolaus Foerster

,

Yoram Bachrach

,

Martin Josifoski

View a PDF of the paper titled AIRA_2: Overcoming Bottlenecks in AI Research Agents, by Karen Hambardzumyan and 24 other authors

View PDF
HTML (experimental)

Abstract:Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes overfitting and performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$^{\dagger}_{2}$ achieves a mean Percentile Rank of 81.5% at 24 hours and 83.1% at 72 hours, outperforming the strongest baseline, which achieves 72.7%. On AIRS-Bench, AIRA$_2$ exceeds human state-of-the-art on 6 out of 20 diverse research tasks. Ablations confirm that each architectural component is necessary, that performance follows a predictable scaling law that transfers across LLM backbones, and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

Submission history

From: Karen Hambardzumyan [

view email

]

[v1]

Fri, 27 Mar 2026 15:02:43 UTC (12,524 KB)

[v2]

Mon, 13 Apr 2026 16:38:00 UTC (13,075 KB)