Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.03684
Cited By
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
5 October 2023
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks"
36 / 186 papers shown
Title
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang
AAML
SILM
49
25
0
09 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
30
87
0
07 Feb 2024
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
Bo Li
Haohan Wang
AAML
49
74
0
30 Jan 2024
Security and Privacy Challenges of Large Language Models: A Survey
B. Das
M. H. Amini
Yanzhao Wu
PILM
ELM
21
108
0
30 Jan 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
42
68
0
29 Jan 2024
Fortifying Ethical Boundaries in AI: Advanced Strategies for Enhancing Security in Large Language Models
Yunhong He
Jianling Qiu
Wei Zhang
Zhe Yuan
32
3
0
27 Jan 2024
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
Zaibin Zhang
Yongting Zhang
Lijun Li
Hongzhi Gao
Lijun Wang
Huchuan Lu
Feng Zhao
Yu Qiao
Jing Shao
LLMAG
22
31
0
22 Jan 2024
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
Kazuhiro Takemoto
42
21
0
18 Jan 2024
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
Dong Shu
Mingyu Jin
Suiyuan Zhu
Beichen Wang
Zihao Zhou
Chong Zhang
Yongfeng Zhang
ELM
47
12
0
17 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
18
257
0
12 Jan 2024
Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang
Liang Ding
Lefei Zhang
Dacheng Tao
LLMSV
30
19
0
12 Jan 2024
Malla: Demystifying Real-world Large Language Model Integrated Malicious Services
Zilong Lin
Jian Cui
Xiaojing Liao
Xiaofeng Wang
32
20
0
06 Jan 2024
MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
Renjie Pi
Tianyang Han
Jianshu Zhang
Yueqi Xie
Rui Pan
Qing Lian
Hanze Dong
Jipeng Zhang
Tong Zhang
AAML
31
60
0
05 Jan 2024
Jatmo: Prompt Injection Defense by Task-Specific Finetuning
Julien Piet
Maha Alrashed
Chawin Sitawarin
Sizhe Chen
Zeming Wei
Elizabeth Sun
Basel Alomair
David Wagner
AAML
SyDa
83
53
0
29 Dec 2023
Exploring Transferability for Randomized Smoothing
Kai Qiu
Huishuai Zhang
Zhirong Wu
Stephen Lin
AAML
26
1
0
14 Dec 2023
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Yifan Yao
Jinhao Duan
Kaidi Xu
Yuanfang Cai
Eric Sun
Yue Zhang
PILM
ELM
54
476
0
04 Dec 2023
Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles
Sonali Singh
Faranak Abri
A. Namin
40
15
0
24 Nov 2023
Transfer Attacks and Defenses for Large Language Models on Coding Tasks
Chi Zhang
Zifan Wang
Ravi Mangal
Matt Fredrikson
Limin Jia
Corina S. Pasareanu
AAML
SILM
29
1
0
22 Nov 2023
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu
Gang Wu
Saayan Mitra
Ruiyi Zhang
Tong Sun
Heng-Chiao Huang
Vishy Swaminathan
32
23
0
20 Nov 2023
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
Nan Xu
Fei Wang
Ben Zhou
Bangzheng Li
Chaowei Xiao
Muhao Chen
34
55
0
16 Nov 2023
Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
Matthew Pisano
Peter Ly
Abraham Sanders
Bingsheng Yao
Dakuo Wang
T. Strzalkowski
Mei Si
AAML
30
24
0
16 Nov 2023
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
Zhexin Zhang
Junxiao Yang
Pei Ke
Fei Mi
Hongning Wang
Minlie Huang
AAML
28
116
0
15 Nov 2023
Can LLMs Follow Simple Rules?
Norman Mu
Sarah Chen
Zifan Wang
Sizhe Chen
David Karamardian
Lulwa Aljeraisy
Basel Alomair
Dan Hendrycks
David Wagner
ALM
31
27
0
06 Nov 2023
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu
Ruiyi Zhang
Bang An
Gang Wu
Joe Barrow
Zichao Wang
Furong Huang
A. Nenkova
Tong Sun
SILM
AAML
30
41
0
23 Oct 2023
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
50
42
0
16 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
61
589
0
12 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
119
307
0
19 Sep 2023
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Raz Lapid
Ron Langberg
Moshe Sipper
AAML
27
106
0
04 Sep 2023
Adversarial Training Should Be Cast as a Non-Zero-Sum Game
Alexander Robey
Fabian Latorre
George J. Pappas
Hamed Hassani
V. Cevher
AAML
66
12
0
19 Jun 2023
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
239
506
0
28 Sep 2022
Certified Defences Against Adversarial Patch Attacks on Semantic Segmentation
Maksym Yatsura
K. Sakmann
N. G. Hua
Matthias Hein
J. H. Metzen
AAML
52
17
0
13 Sep 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
372
12,081
0
04 Mar 2022
Evaluating the Adversarial Robustness of Adaptive Test-time Defenses
Francesco Croce
Sven Gowal
T. Brunner
Evan Shelhamer
Matthias Hein
A. Cemgil
TTA
AAML
181
68
0
28 Feb 2022
Adversarial Robustness with Semi-Infinite Constrained Learning
Alexander Robey
Luiz F. O. Chamon
George J. Pappas
Hamed Hassani
Alejandro Ribeiro
AAML
OOD
118
43
0
29 Oct 2021
RobustBench: a standardized adversarial robustness benchmark
Francesco Croce
Maksym Andriushchenko
Vikash Sehwag
Edoardo Debenedetti
Nicolas Flammarion
M. Chiang
Prateek Mittal
Matthias Hein
VLM
234
680
0
19 Oct 2020
Generating Natural Language Adversarial Examples
M. Alzantot
Yash Sharma
Ahmed Elgohary
Bo-Jhang Ho
Mani B. Srivastava
Kai-Wei Chang
AAML
258
916
0
21 Apr 2018
Previous
1
2
3
4