Recent Advances in Attack and Defense Approaches of Large Language
Models

v1v2 (latest)

Recent Advances in Attack and Defense Approaches of Large Language Models

5 September 2024

ArXiv (abs)PDF HTML

Papers citing "Recent Advances in Attack and Defense Approaches of Large Language Models"

8 / 8 papers shown

Title
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates Kaifeng Lyu Haoyu Zhao Xinran Gu Dingli Yu Anirudh Goyal Sanjeev Arora ALM 133 59 0 20 Jan 2025
Tamper-Resistant Safeguards for Open-Weight LLMs Rishub Tamirisa Bhrugu Bharathi Long Phan Andy Zhou Alice Gatti ... Andy Zou Dawn Song Bo Li Dan Hendrycks Mantas Mazeika AAML MU 133 63 0 01 Aug 2024
Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko Nicolas Flammarion 142 36 0 16 Jul 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Maksym Andriushchenko Francesco Croce Nicolas Flammarion AAML 206 222 0 02 Apr 2024
Hallucination is Inevitable: An Innate Limitation of Large Language Models Ziwei Xu Sanjay Jain Mohan S. Kankanhalli HILM LRM 172 259 0 22 Jan 2024
Certifying LLM Safety against Adversarial Prompting Aounon Kumar Chirag Agarwal Suraj Srinivas Aaron Jiaxun Li Soheil Feizi Himabindu Lakkaraju AAML 155 197 0 06 Sep 2023
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning Yun Luo Zhen Yang Fandong Meng Yafu Li Jie Zhou Yue Zhang CLL KELM 211 319 0 17 Aug 2023
What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation Vitaly Feldman Chiyuan Zhang TDI 248 472 0 09 Aug 2020