Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Main:9 Pages
Bibliography:2 Pages
7 Tables
Abstract
Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 16 research papers detailing 15 prototypes and their respective testbeds.
View on arXiv@article{happe2025_2504.10112, title={ Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design }, author={ Andreas Happe and Jürgen Cito }, journal={arXiv preprint arXiv:2504.10112}, year={ 2025 } }
Comments on this paper