18
v1v2v3 (latest)

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Linus Folkerts
Will Payne
Simon Inman
Philippos Giavridis
Joe Skinner
Sam Deverett
James Aung
Ekin Zorer
Michael Schmatz
Mahmoud Ghanem
John Wilkinson
Alan Steer
Vy Hong
Jessica Wang
Main:7 Pages
7 Figures
Bibliography:2 Pages
1 Tables
Appendix:4 Pages
Abstract

We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need. On the industrial control system range, performance remains limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 (max 3).

View on arXiv
Comments on this paper