Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2309.00614
Cited By
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
1 September 2023
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Baseline Defenses for Adversarial Attacks Against Aligned Language Models"
50 / 269 papers shown
Title
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
Jiawei Zhao
Kejiang Chen
Xiaojian Yuan
Weiming Zhang
AAML
31
2
0
15 Aug 2024
Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Hui Li
AAML
34
11
0
08 Aug 2024
EnJa: Ensemble Jailbreak on Large Language Models
Jiahao Zhang
Zilong Wang
Ruofan Wang
Xingjun Ma
Yu-Gang Jiang
AAML
34
1
0
07 Aug 2024
Practical Attacks against Black-box Code Completion Engines
Slobodan Jenko
Jingxuan He
Niels Mündler
Mark Vero
Martin Vechev
ELM
AAML
SILM
27
3
0
05 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
43
8
0
02 Aug 2024
Can LLMs be Fooled? Investigating Vulnerabilities in LLMs
Sara Abdali
Jia He
C. Barberan
Richard Anarfi
36
7
0
30 Jul 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
56
11
0
25 Jul 2024
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
Shi Lin
Rongchang Li
Xun Wang
Changting Lin
Xun Wang
Wenpeng Xing
Meng Han
Meng Han
60
3
0
23 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
54
10
0
20 Jul 2024
Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models
Zhuo Chen
Jiawei Liu
Haotan Liu
Qikai Cheng
Qikai Cheng
Wei Lu
Xiaozhong Liu
AAML
36
6
0
18 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
50
27
0
16 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
45
19
0
12 Jul 2024
Defending Code Language Models against Backdoor Attacks with Deceptive Cross-Entropy Loss
Guang Yang
Yu Zhou
Xiang Chen
Xiangyu Zhang
Terry Yue Zhuo
David Lo
Taolue Chen
AAML
52
4
0
12 Jul 2024
R
2
R^2
R
2
-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Bo-wen Li
LRM
40
12
0
08 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
36
80
0
05 Jul 2024
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
Hannah Brown
Leon Lin
Kenji Kawaguchi
Michael Shieh
AAML
75
6
0
03 Jul 2024
Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning
Simon Ostermann
Kevin Baum
Christoph Endres
Julia Masloh
P. Schramowski
AAML
54
1
0
03 Jul 2024
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy
Pedro M. Esperança
Silviu Vlad Oprea
Mete Ozay
KELM
36
2
0
03 Jul 2024
Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
Zisu Huang
Xiaohua Wang
Feiran Zhang
Zhibo Xu
Cenyuan Zhang
Xiaoqing Zheng
Xuanjing Huang
AAML
LRM
37
4
0
01 Jul 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou
Henry Peng Zou
Barbara Maria Di Eugenio
Yang Zhang
HILM
LRM
52
1
0
01 Jul 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
37
30
0
28 Jun 2024
Seeing Is Believing: Black-Box Membership Inference Attacks Against Retrieval Augmented Generation
Yongqian Li
Gaoyang Liu
Yang Yang
Chen Wang
AAML
33
3
0
27 Jun 2024
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang
Wanxu Zhao
Rui Zheng
Huijie Lv
Shihan Dou
...
Junjie Ye
Yuming Yang
Tao Gui
Qi Zhang
Xuanjing Huang
LLMSV
AAML
47
7
0
26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
Somnath Banerjee
Soham Tripathy
Sayan Layek
Shanu Kumar
Animesh Mukherjee
Rima Hazra
27
1
0
18 Jun 2024
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Kenneth Li
Yiming Wang
Fernanda Viégas
Martin Wattenberg
38
6
0
17 Jun 2024
Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications
Stephen Burabari Tete
39
7
0
16 Jun 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
Pengfei He
Han Xu
Yue Xing
Makoto Yamada
Hui Liu
Jiliang Tang
34
10
0
16 Jun 2024
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
Xuan Chen
Yuzhou Nie
Lu Yan
Yunshu Mao
Wenbo Guo
Xiangyu Zhang
28
7
0
13 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
47
10
0
13 Jun 2024
Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey
Shang Wang
Tianqing Zhu
Bo Liu
Ming Ding
Xu Guo
Dayong Ye
Wanlei Zhou
Philip S. Yu
PILM
67
17
0
12 Jun 2024
Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents
Avital Shafran
R. Schuster
Vitaly Shmatikov
46
27
0
09 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
76
8
0
08 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
44
72
0
06 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
52
8
0
06 Jun 2024
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
Amelia Kawasaki
Andrew Davis
Houssam Abbas
AAML
KELM
32
2
0
05 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
30
0
0
04 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
44
23
0
04 Jun 2024
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
35
19
0
03 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
AAML
68
29
0
03 Jun 2024
Exploring Vulnerabilities and Protections in Large Language Models: A Survey
Frank Weizhen Liu
Chenhui Hu
AAML
37
7
0
01 Jun 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
40
18
0
31 May 2024
Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
Harsh Chaudhari
Giorgio Severi
John Abascal
Matthew Jagielski
Christopher A. Choquette-Choo
Milad Nasr
Cristina Nita-Rotaru
Alina Oprea
SILM
AAML
77
28
0
30 May 2024
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Chen Xiong
Xiangyu Qi
Pin-Yu Chen
Tsung-Yi Ho
AAML
34
19
0
30 May 2024
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
Jiawei Chen
Xiao Yang
Zhengwei Fang
Yu Tian
Yinpeng Dong
Zhaoxia Yin
Hang Su
27
1
0
30 May 2024
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Bo-wen Li
Dawn Song
Peter Henderson
Prateek Mittal
AAML
51
11
0
29 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
44
13
0
28 May 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
63
12
0
28 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
51
17
0
25 May 2024
Efficient Adversarial Training in LLMs with Continuous Attacks
Sophie Xhonneux
Alessandro Sordoni
Stephan Günnemann
Gauthier Gidel
Leo Schwinn
AAML
42
45
0
24 May 2024
Previous
1
2
3
4
5
6
Next