Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

14 April 2025

A. Happe

Jürgen Cito

ArXiv (abs)PDF HTML

Main:9 Pages

Bibliography:2 Pages

7 Tables

Abstract

Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 16 research papers detailing 15 prototypes and their respective testbeds.

View on arXiv

@article{happe2025_2504.10112,
  title={ Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design },
  author={ Andreas Happe and Jürgen Cito },
  journal={arXiv preprint arXiv:2504.10112},
  year={ 2025 }
}

Comments on this paper