66
1

Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Main:9 Pages
Bibliography:2 Pages
7 Tables
Abstract

Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 16 research papers detailing 15 prototypes and their respective testbeds.

View on arXiv
@article{happe2025_2504.10112,
  title={ Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design },
  author={ Andreas Happe and Jürgen Cito },
  journal={arXiv preprint arXiv:2504.10112},
  year={ 2025 }
}
Comments on this paper