ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.05223
41
0

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

5 February 2025
Buyun Liang
Kwan Ho Ryan Chan
D. Thaker
Jinqi Luo
René Vidal
    AAML
ArXivPDFHTML
Abstract

Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.

View on arXiv
@article{liang2025_2502.05223,
  title={ KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs },
  author={ Buyun Liang and Kwan Ho Ryan Chan and Darshan Thaker and Jinqi Luo and René Vidal },
  journal={arXiv preprint arXiv:2502.05223},
  year={ 2025 }
}
Comments on this paper