Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2309.01446
Cited By
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
4 September 2023
Raz Lapid
Ron Langberg
Moshe Sipper
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Open Sesame! Universal Black Box Jailbreaking of Large Language Models"
50 / 82 papers shown
Title
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
76
0
0
23 Apr 2025
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models
Carlos Peláez-González
Andrés Herrera-Poyatos
Cristina Zuheros
David Herrera-Poyatos
Virilo Tejedor
F. Herrera
AAML
24
0
0
07 Apr 2025
Don't Lag, RAG: Training-Free Adversarial Detection Using RAG
Roie Kazoom
Raz Lapid
Moshe Sipper
Ofer Hadar
VLM
ObjD
AAML
57
0
0
07 Apr 2025
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Joonhyun Jeong
Seyun Bae
Yeonsung Jung
Jaeryong Hwang
Eunho Yang
AAML
43
1
0
26 Mar 2025
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Andy Zhou
Kevin E. Wu
Francesco Pinto
Z. Chen
Yi Zeng
Yu Yang
Shuang Yang
Sanmi Koyejo
James Zou
Bo Li
LLMAG
AAML
77
0
0
20 Mar 2025
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
Dillon Bowen
Ann-Kathrin Dombrowski
Adam Gleave
Chris Cundy
ELM
50
0
0
17 Mar 2025
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Shuyi Liu
Simiao Cui
Haoran Bu
Yuming Shang
Xi Zhang
ELM
67
0
0
26 Feb 2025
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
Buyun Liang
Kwan Ho Ryan Chan
D. Thaker
Jinqi Luo
René Vidal
AAML
46
0
0
05 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
112
10
0
28 Jan 2025
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
Jonathan Nöther
Adish Singla
Goran Radanović
AAML
57
0
0
14 Jan 2025
Global Challenge for Safe and Secure LLMs Track 1
Xiaojun Jia
Yihao Huang
Yang Liu
Peng Yan Tan
Weng Kuan Yau
...
Yan Wang
Rick Siow Mong Goh
Liangli Zhen
Yingjie Zhang
Zhe Zhao
ELM
AILaw
74
0
0
21 Nov 2024
DROJ: A Prompt-Driven Attack against Large Language Models
Leyang Hu
Boran Wang
34
0
0
14 Nov 2024
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
149
0
0
06 Nov 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Yalin Wang
Cihang Xie
32
5
0
11 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
47
8
0
09 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
34
12
0
02 Oct 2024
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
Lijia Lv
Weigang Zhang
Xuehai Tang
Jie Wen
Feng Liu
Jizhong Han
Songlin Hu
AAML
29
2
0
11 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
57
1
0
05 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
42
13
0
01 Sep 2024
On the Robustness of Kolmogorov-Arnold Networks: An Adversarial Perspective
Tal Alter
Raz Lapid
Moshe Sipper
AAML
62
6
0
25 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
43
8
0
02 Aug 2024
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu
Wenhui Zhang
Zhibo Wang
Feng Xiao
Rui Zheng
Yunhe Feng
Zhongjie Ba
Kui Ren
AAML
LLMAG
34
11
0
23 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
50
27
0
16 Jul 2024
Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation
Riccardo Cantini
Giada Cosenza
A. Orsino
Domenico Talia
AAML
57
5
0
11 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
39
80
0
05 Jul 2024
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
Xiaotian Zou
Ke Li
Yongkang Chen
MLLM
42
2
0
01 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
Large Language Models as Surrogate Models in Evolutionary Algorithms: A Preliminary Study
Hao Hao
Xiaoqun Zhang
Aimin Zhou
ELM
38
9
0
15 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
47
10
0
13 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
52
8
0
06 Jun 2024
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
37
19
0
03 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
AAML
68
29
0
03 Jun 2024
Exploring Vulnerabilities and Protections in Large Language Models: A Survey
Frank Weizhen Liu
Chenhui Hu
AAML
37
7
0
01 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Xiaochun Cao
Min-Bin Lin
AAML
52
22
0
31 May 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
40
18
0
31 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
47
13
0
28 May 2024
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
Ziyang Zhang
Qizhen Zhang
Jakob N. Foerster
AAML
40
18
0
13 May 2024
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
Shang Shang
Xinqiang Zhao
Zhongjiang Yao
Yepeng Yao
Liya Su
Zijing Fan
Xiaodan Zhang
Zhengwei Jiang
55
4
0
06 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
45
36
0
06 May 2024
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
Valeriia Cherepanova
James Zou
AAML
33
4
0
26 Apr 2024
Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors
Raz Lapid
Almog Dubin
Moshe Sipper
AAML
22
4
0
18 Apr 2024
Ethical Framework for Responsible Foundational Models in Medical Imaging
Abhijit Das
Debesh Jha
Jasmer Sanjotra
Onkar Susladkar
Suramyaa Sarkar
A. Rauniyar
Nikhil Tomar
Vanshali Sharma
Ulas Bagci
MedIm
82
0
0
14 Apr 2024
Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models
Beichen Huang
Xingyu Wu
Yu Zhou
Jibin Wu
Liang Feng
Ran Cheng
Kay Chen Tan
58
12
0
09 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
95
160
0
02 Apr 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Alexander Robey
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALM
ELM
AAML
57
96
0
28 Mar 2024
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi
Zenghui Yuan
Yinuo Liu
Yue Huang
Pan Zhou
Lichao Sun
Neil Zhenqiang Gong
AAML
45
39
0
26 Mar 2024
XAI-Based Detection of Adversarial Attacks on Deepfake Detectors
Ben Pinhasov
Raz Lapid
Rony Ohayon
Moshe Sipper
Y. Aperstein
AAML
35
7
0
05 Mar 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan
Sharath Chandra Raparthy
Andrei Lupu
Eric Hambro
Aram H. Markosyan
...
Minqi Jiang
Jack Parker-Holder
Jakob Foerster
Tim Rocktaschel
Roberta Raileanu
SyDa
80
62
0
26 Feb 2024
Multi-Bit Distortion-Free Watermarking for Large Language Models
Massieh Kordi Boroujeny
Ya Jiang
Kai Zeng
Brian L. Mark
WaLM
VLM
43
4
0
26 Feb 2024
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Jiabao Ji
Bairu Hou
Alexander Robey
George J. Pappas
Hamed Hassani
Yang Zhang
Eric Wong
Shiyu Chang
AAML
47
39
0
25 Feb 2024
1
2
Next