v1v2 (latest)

Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

14 April 2025

Papers citing "Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design"

22 / 22 papers shown

Title
Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements I. Isozaki Manil Shrestha Rick Console Edward Kim ELM 102 7 0 24 Feb 2025
RapidPen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents Sho Nakatani 117 3 1 23 Feb 2025
Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks A. Happe Jürgen Cito 96 4 0 06 Feb 2025
HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing Lajos Muzsai David Imolai András Lukács LLMAG 113 12 0 02 Dec 2024
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? Benlong Wu Guoqiang Chen Kejiang Chen Xiuwei Shang Jiapeng Han Yanru He Weinan Zhang Nenghai Yu LLMAG 69 5 0 02 Nov 2024
AutoPenBench: Benchmarking Generative Agents for Penetration Testing Luca Gioacchini Marco Mellia Idilio Drago Alexander Delsanto G. Siracusano Roberto Bifulco ELM 65 6 0 04 Oct 2024
CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models Shengye Wan Cyrus Nikolaidis Daniel Song David Molnar James Crnkovich ... Spencer Whitman Stephanie Ding Vlad Ionescu Yue Li Joshua Saxe ELM 86 22 0 02 Aug 2024
PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation Junjie Huang Quanyan Zhu 61 21 0 25 Jul 2024
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities Richard Fang Antony Kellermann Akul Gupta Qiusi Zhan Richard Fang R. Bindu Daniel Kang LLMAG 86 35 0 02 Jun 2024
A Comprehensive Overview of Large Language Models (LLMs) for Cyber Defences: Opportunities and Directions Mohammed Hassanin Nour Moustafa 75 31 0 23 May 2024
Large Language Models for Cyber Security: A Systematic Literature Review HanXiang Xu Shenao Wang Ningke Li Kaidi Wang Yanjie Zhao Kai Chen Ting Yu Yang Liu Haoyu Wang 111 41 0 08 May 2024
CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models Manish P Bhatt Sahana Chennabasappa Yue Li Cyrus Nikolaidis Daniel Song ... Yaohui Chen Dhaval Kapil David Molnar Spencer Whitman Joshua Saxe ELM 89 41 0 19 Apr 2024
LLM Agents can Autonomously Exploit One-day Vulnerabilities Richard Fang R. Bindu Akul Gupta Daniel Kang SILM LLMAG 125 66 0 11 Apr 2024
Review of Generative AI Methods in Cybersecurity Yagmur Yigit William J. Buchanan Madjid G Tehrani Leandros A. Maglaras AAML 129 23 0 13 Mar 2024
AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks Jiacen Xu Jack W. Stokes Geoff McDonald Xuesong Bai David Marshall Siyue Wang Adith Swaminathan Zhou Li 90 58 0 02 Mar 2024
An Empirical Evaluation of LLMs for Solving Offensive Security Challenges Minghao Shao Boyuan Chen Sofija Jancheska Brendan Dolan-Gavitt Siddharth Garg Ramesh Karri Mohamed Bennai 79 30 0 19 Feb 2024
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models Manish P Bhatt Sahana Chennabasappa Cyrus Nikolaidis Shengye Wan Ivan Evtimov ... Aleksandar Straumann Gabriel Synnaeve Varun Vontimitta Spencer Whitman Joshua Saxe ELM 95 80 0 07 Dec 2023
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly Yifan Yao Jinhao Duan Kaidi Xu Yuanfang Cai Eric Sun Yue Zhang PILM ELM 97 545 0 04 Dec 2023
LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks A. Happe Aaron Kaplan Jürgen Cito 73 17 0 17 Oct 2023
Understanding Hackers' Work: An Empirical Study of Offensive Security Practitioners A. Happe Jürgen Cito 49 11 0 14 Aug 2023
Getting pwn'd by AI: Penetration Testing with Large Language Models A. Happe Jürgen Cito 68 83 0 24 Jul 2023
Large Language Models Michael R Douglas LLMAG LM&MA 138 642 0 11 Jul 2023