Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.17263
Cited By
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
30 January 2024
Andy Zhou
Bo Li
Haohan Wang
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
26 / 26 papers shown
Title
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs
Sijia Chen
Xiaomin Li
Mengxue Zhang
Eric Hanchen Jiang
Qingcheng Zeng
Chen-Hsiang Yu
AAML
MU
ELM
27
0
0
16 May 2025
OET: Optimization-based prompt injection Evaluation Toolkit
Jinsheng Pan
Xiaogeng Liu
Chaowei Xiao
AAML
69
0
0
01 May 2025
ASIDE: Architectural Separation of Instructions and Data in Language Models
Egor Zverev
Evgenii Kortukov
Alexander Panfilov
Soroush Tabesh
Alexandra Volkova
Sebastian Lapuschkin
Wojciech Samek
Christoph H. Lampert
AAML
54
1
0
13 Mar 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xiaotian Zhang
AAML
149
0
0
27 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
V. Cevher
AAML
51
0
0
24 Feb 2025
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
67
0
0
24 Feb 2025
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
Yi Wang
Fenghua Weng
Songlin Yang
Zhan Qin
Minlie Huang
Wenjie Wang
KELM
AAML
53
0
0
17 Feb 2025
Jailbreaking to Jailbreak
Jeremy Kritz
Vaughn Robinson
Robert Vacareanu
Bijan Varjavand
Michael Choi
Bobby Gogov
Scale Red Team
Summer Yue
Willow Primack
Zifan Wang
209
1
0
09 Feb 2025
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
72
0
0
19 Dec 2024
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao
Xiang Zheng
Lin Luo
Yige Li
Xingjun Ma
Yu-Gang Jiang
VLM
AAML
62
3
0
28 Oct 2024
Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution
Ankita Sinha
Wendi Cui
Kamalika Das
Jiaxin Zhang
AAML
28
2
0
12 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
133
2
0
09 Oct 2024
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
81
5
0
02 Oct 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
51
41
0
01 Aug 2024
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
Federico Errica
G. Siracusano
D. Sanvito
Roberto Bifulco
80
19
0
18 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
47
10
0
13 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
76
8
0
08 Jun 2024
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev
Sahar Abdelnabi
Soroush Tabesh
Mario Fritz
Christoph H. Lampert
56
19
0
11 Mar 2024
MultiRobustBench: Benchmarking Robustness Against Multiple Attacks
Sihui Dai
Saeed Mahloujifar
Chong Xiang
Vikash Sehwag
Pin-Yu Chen
Prateek Mittal
AAML
OOD
29
7
0
21 Feb 2023
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
255
2,494
0
06 Oct 2022
Diffusion Models for Adversarial Purification
Weili Nie
Brandon Guo
Yujia Huang
Chaowei Xiao
Arash Vahdat
Anima Anandkumar
WIGM
218
418
0
16 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
290
1,815
0
14 Dec 2020
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
286
1,595
0
18 Sep 2019
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
223
616
0
03 Sep 2019
Generating Natural Language Adversarial Examples
M. Alzantot
Yash Sharma
Ahmed Elgohary
Bo-Jhang Ho
Mani B. Srivastava
Kai-Wei Chang
AAML
245
914
0
21 Apr 2018
1