Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.15221
Cited By
v1
v2 (latest)
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
27 August 2024
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAML
MU
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet"
31 / 31 papers shown
Title
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
117
1
0
08 Mar 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xinsong Zhang
AAML
367
1
0
27 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
126
7
0
03 Feb 2025
Do Unlearning Methods Remove Information from Language Model Weights?
Aghyad Deeb
Fabien Roger
AAML
MU
81
28
0
11 Oct 2024
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
Pratiksha Thaker
Shengyuan Hu
Neil Kale
Yash Maurya
Zhiwei Steven Wu
Virginia Smith
MU
113
24
0
03 Oct 2024
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
119
8
0
02 Oct 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
101
62
0
01 Aug 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
120
35
0
16 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
91
29
0
12 Jul 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
88
34
0
28 Jun 2024
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang
Kavel Rao
Seungju Han
Allyson Ettinger
Faeze Brahman
...
Niloofar Mireshghallah
Ximing Lu
Maarten Sap
Yejin Choi
Nouha Dziri
64
72
0
26 Jun 2024
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
M. Russinovich
Ahmed Salem
Ronen Eldan
103
98
0
02 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
155
220
0
02 Apr 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
98
81
0
26 Feb 2024
PAL: Proxy-Guided Black-Box Attack on Large Language Models
Chawin Sitawarin
Norman Mu
David Wagner
Alexandre Araujo
ELM
67
34
0
15 Feb 2024
Attacking Large Language Models with Projected Gradient Descent
Simon Geisler
Tom Wollschlager
M. H. I. Abdalla
Johannes Gasteiger
Stephan Günnemann
AAML
SILM
115
61
0
14 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
113
88
0
25 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
90
312
0
12 Jan 2024
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
57
220
0
09 Dec 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
185
351
0
19 Sep 2023
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?
David Glukhov
Ilia Shumailov
Y. Gal
Nicolas Papernot
Vardan Papyan
AAML
ELM
85
58
0
20 Jul 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
387
4,125
0
29 May 2023
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
208
1,634
0
15 Dec 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
880
13,148
0
04 Mar 2022
Red Teaming Language Models with Language Models
Ethan Perez
Saffron Huang
Francis Song
Trevor Cai
Roman Ring
John Aslanides
Amelia Glaese
Nat McAleese
G. Irving
AAML
174
664
0
07 Feb 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
242
292
0
28 Sep 2021
Adversarial Examples Are Not Bugs, They Are Features
Andrew Ilyas
Shibani Santurkar
Dimitris Tsipras
Logan Engstrom
Brandon Tran
Aleksander Madry
SILM
91
1,843
0
06 May 2019
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
Anish Athalye
Nicholas Carlini
D. Wagner
AAML
243
3,194
0
01 Feb 2018
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry
Aleksandar Makelov
Ludwig Schmidt
Dimitris Tsipras
Adrian Vladu
SILM
OOD
310
12,117
0
19 Jun 2017
Adversarial examples in the physical world
Alexey Kurakin
Ian Goodfellow
Samy Bengio
SILM
AAML
545
5,909
0
08 Jul 2016
Explaining and Harnessing Adversarial Examples
Ian Goodfellow
Jonathon Shlens
Christian Szegedy
AAML
GAN
280
19,107
0
20 Dec 2014
1