ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2512.09882
12
0

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

10 December 2025
Justin W. Lin
Eliot Krzysztof Jones
Donovan Julian Jasper
Ethan Jun-shen Ho
Anna Wu
Arnold Tianyi Yang
Neil Perry
Andy Zou
Matt Fredrikson
J. Zico Kolter
Percy Liang
Dan Boneh
Daniel E. Ho
    LLMAG
ArXiv (abs)PDFHTML
Main:10 Pages
5 Figures
Bibliography:3 Pages
6 Tables
Appendix:16 Pages
Abstract

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

View on arXiv
Comments on this paper