Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.08419
Cited By
v1
v2
v3
v4 (latest)
Jailbreaking Black Box Large Language Models in Twenty Queries
12 October 2023
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Jailbreaking Black Box Large Language Models in Twenty Queries"
50 / 196 papers shown
Title
Smoothed Embeddings for Robust Language Models
Ryo Hase
Md Rafi Ur Rashid
Ashley Lewis
Jing Liu
T. Koike-Akino
K. Parsons
Yanjie Wang
AAML
116
2
0
27 Jan 2025
HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
Zihui Wu
Haichang Gao
Jiacheng Luo
Zhaoxiang Liu
155
0
0
23 Jan 2025
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
Melissa Kazemi Rad
Huy Nghiem
Andy Luo
Sahil Wadhwa
Mohammad Sorower
Stephen Rawls
AAML
154
5
0
22 Jan 2025
An Empirically-grounded tool for Automatic Prompt Linting and Repair: A Case Study on Bias, Vulnerability, and Optimization in Developer Prompts
Dhia Elhaq Rzig
Dhruba Jyoti Paul
Kaiser Pister
Jordan Henkel
Foyzul Hassan
135
0
0
21 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
133
59
0
20 Jan 2025
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
Fengxiang Wang
Ranjie Duan
Peng Xiao
Xiaojun Jia
Shiji Zhao
...
Hang Su
Jialing Tao
Hui Xue
Jun Zhu
Hui Xue
LLMAG
93
10
0
08 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
120
11
0
08 Jan 2025
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Yang Ouyang
Hengrui Gu
Shuhang Lin
Wenyue Hua
Jie Peng
B. Kailkhura
Tianlong Chen
Kaixiong Zhou
Kaixiong Zhou
AAML
117
3
0
05 Jan 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Sihang Li
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
137
1
0
03 Jan 2025
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang
Hao Li
Junda Zhu
Xinyuan Wang
Changzai Pan
Minlie Huang
Lei Sha
349
0
0
23 Dec 2024
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
207
0
0
19 Dec 2024
SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation
Qinglin Qi
Yun Luo
Yijia Xu
Wenbo Guo
Yong Fang
AAML
132
2
0
15 Dec 2024
Time-Reversal Provides Unsupervised Feedback to LLMs
Yerram Varun
Rahul Madhavan
Sravanti Addepalli
A. Suggala
Karthikeyan Shanmugam
Prateek Jain
LRM
SyDa
107
0
0
03 Dec 2024
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
Zhi-Yi Chin
Kuan-Chen Mu
Mario Fritz
Pin-Yu Chen
DiffM
192
1
0
25 Nov 2024
Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation
Ke Zhao
Huayang Huang
Miao Li
Yu Wu
AAML
114
1
0
21 Nov 2024
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Blake Bullwinkel
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
164
18
0
18 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
132
7
0
17 Nov 2024
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
487
1
0
06 Nov 2024
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Eric Ma
Gaurav Verma
Srijan Kumar
125
7
0
03 Nov 2024
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao
Kejiang Chen
Weinan Zhang
Nenghai Yu
AAML
140
0
0
03 Nov 2024
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
Vishal Kumar
Zeyi Liao
Jaylen Jones
Huan Sun
AAML
121
3
0
29 Oct 2024
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Honglin Mu
Han He
Yuxin Zhou
Yunlong Feng
Yang Xu
...
Zeming Liu
Xudong Han
Qi Shi
Qingfu Zhu
Wanxiang Che
AAML
98
1
0
28 Oct 2024
Vulnerability of LLMs to Vertically Aligned Text Manipulations
Zhecheng Li
Yijiao Wang
Bryan Hooi
Yujun Cai
Zhen Xiong
Nanyun Peng
Kai-Wei Chang
141
1
0
26 Oct 2024
An Auditing Test To Detect Behavioral Shift in Language Models
Leo Richter
Xuanli He
Pasquale Minervini
Matt J. Kusner
95
0
0
25 Oct 2024
ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
Chung-En Sun
Xiaodong Liu
Weiwei Yang
Tsui-Wei Weng
Hao Cheng
Aidan San
Michel Galley
J. Gao
135
2
0
24 Oct 2024
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
Rui Pu
Chaozhuo Li
Rui Ha
Zejian Chen
Litian Zhang
Ziqiang Liu
Lirong Qiu
Xi Zhang
AAML
66
3
0
18 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Sihang Li
Yongbin Li
162
10
0
17 Oct 2024
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
102
2
0
16 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
145
1
0
15 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
173
1
0
09 Oct 2024
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu
Shujian Zhang
Kaiqiang Song
Silei Xu
Sanqiang Zhao
Ravi Agrawal
Sathish Indurthi
Chong Xiang
Prateek Mittal
Wenxuan Zhou
112
14
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
89
13
0
09 Oct 2024
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Zi Wang
Divyam Anshumaan
Ashish Hooda
Yudong Chen
Somesh Jha
AAML
96
0
0
05 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
118
4
0
03 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
231
6
0
02 Oct 2024
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
177
8
0
02 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
150
19
0
30 Sep 2024
Multimodal Pragmatic Jailbreak on Text-to-image Models
Tong Liu
Zhixin Lai
Jiawen Wang
Gengyuan Zhang
Shuo Chen
Philip Torr
Vera Demberg
Volker Tresp
Jindong Gu
73
5
0
27 Sep 2024
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs
Jiahao Yu
Yangguang Shao
Hanwen Miao
Junzheng Shi
SILM
AAML
167
11
0
23 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
121
2
0
05 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
138
18
0
01 Sep 2024
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
Prannaya Gupta
Le Qi Yau
Hao Han Low
I-Shiang Lee
Hugo Maximus Lim
...
Jia Hng Koh
Dar Win Liew
Rishabh Bhardwaj
Rajat Bhardwaj
Soujanya Poria
ELM
LM&MA
102
6
0
07 Aug 2024
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
Zi Liang
Haibo Hu
Qingqing Ye
Yaxin Xiao
Haoyang Li
AAML
ELM
SILM
146
9
0
05 Aug 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
142
36
0
16 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
115
32
0
12 Jul 2024
Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation
Riccardo Cantini
Giada Cosenza
A. Orsino
Domenico Talia
AAML
126
7
0
11 Jul 2024
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
Yibo Miao
Yifan Zhu
Yinpeng Dong
Lijia Yu
Jun Zhu
Xiao-Shan Gao
EGVM
127
20
0
08 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov
49
5
0
01 Jul 2024
Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
Zisu Huang
Xiaohua Wang
Feiran Zhang
Zhibo Xu
Cenyuan Zhang
Qi Qian
Xiaoqing Zheng
Xuanjing Huang
AAML
LRM
106
4
0
01 Jul 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou
Henry Peng Zou
Barbara Di Eugenio
Yang Zhang
LRM
HILM
143
6
0
01 Jul 2024
Previous
1
2
3
4
Next