Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.06387
Cited By
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
10 October 2023
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations"
50 / 191 papers shown
Title
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
Aidan Wong
He Cao
Zijing Liu
Yu Li
44
2
0
21 Oct 2024
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Xiao-Li Li
Zhuhong Li
Qiongxiu Li
Bingze Lee
Jinghao Cui
Xiaolin Hu
AAML
32
2
0
20 Oct 2024
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
30
0
0
17 Oct 2024
Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning
Huiwen Wu
Xiaohan Li
Xiaogang Xu
Xiaogang Xu
Deyi Zhang
Zhe Liu
MLLM
CLL
VLM
39
0
0
16 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
68
0
0
15 Oct 2024
Cognitive Overload Attack:Prompt Injection for Long Context
Bibek Upadhayay
Vahid Behzadan
Amin Karbasi
AAML
34
2
0
15 Oct 2024
Locking Down the Finetuned LLMs Safety
Minjun Zhu
Linyi Yang
Yifan Wei
Ningyu Zhang
Yue Zhang
42
8
0
14 Oct 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Yanjie Wang
Cihang Xie
32
5
0
11 Oct 2024
On the Adversarial Transferability of Generalized "Skip Connections"
Yisen Wang
Yichuan Mo
Dongxian Wu
Mingjie Li
Xingjun Ma
Zhouchen Lin
AAML
31
2
0
11 Oct 2024
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
Xiaogeng Liu
Chaowei Xiao
AAML
34
3
0
11 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
47
8
0
09 Oct 2024
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu
Shujian Zhang
Kaiqiang Song
Silei Xu
Sanqiang Zhao
Ravi Agrawal
Sathish Indurthi
Chong Xiang
Prateek Mittal
Wenxuan Zhou
45
8
0
09 Oct 2024
Non-Halting Queries: Exploiting Fixed Points in LLMs
Ghaith Hammouri
Kemal Derya
B. Sunar
33
0
0
08 Oct 2024
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
Yiting Dong
Guobin Shen
Dongcheng Zhao
Xiang He
Yi Zeng
34
0
0
05 Oct 2024
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
Tianyu Wu
Lingrui Mei
Ruibin Yuan
Lujun Li
Wei Xue
Yike Guo
48
1
0
04 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
47
0
0
03 Oct 2024
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Xiaogeng Liu
Peiran Li
Edward Suh
Yevgeniy Vorobeychik
Zhuoqing Mao
Somesh Jha
Patrick McDaniel
Huan Sun
Bo Li
Chaowei Xiao
34
17
0
03 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
67
2
0
02 Oct 2024
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
81
5
0
02 Oct 2024
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration
Jiahui Gao
Renjie Pi
Tianyang Han
Han Wu
Lanqing Hong
Lingpeng Kong
Xin Jiang
Zhenguo Li
41
5
0
17 Sep 2024
Jailbreaking Large Language Models with Symbolic Mathematics
Emet Bethany
Mazal Bethany
Juan Arturo Nolazco Flores
S. Jha
Peyman Najafirad
AAML
18
3
0
17 Sep 2024
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
Lijia Lv
Weigang Zhang
Xuehai Tang
Jie Wen
Feng Liu
Jizhong Han
Songlin Hu
AAML
31
2
0
11 Sep 2024
MILE: A Mutation Testing Framework of In-Context Learning Systems
Zeming Wei
Yihao Zhang
Meng Sun
48
0
0
07 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
57
1
0
05 Sep 2024
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
34
0
0
21 Aug 2024
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models
Hongbang Yuan
Zhuoran Jin
Pengfei Cao
Yubo Chen
Kang Liu
Jun Zhao
AAML
ELM
MU
52
1
0
20 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Donghai Hong
Jiaying Chen
Wenhai Wang
44
1
0
18 Aug 2024
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
Jiawei Zhao
Kejiang Chen
Xiaojian Yuan
Weiming Zhang
AAML
36
2
0
15 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
43
8
0
02 Aug 2024
Can Editing LLMs Inject Harm?
Canyu Chen
Baixiang Huang
Zekun Li
Zhaorun Chen
Shiyang Lai
...
Xifeng Yan
William Wang
Philip Torr
Dawn Song
Kai Shu
KELM
44
11
0
29 Jul 2024
Polynomial Regression as a Task for Understanding In-context Learning Through Finetuning and Alignment
Max Wilcoxson
Morten Svendgård
Ria Doshi
Dylan Davis
Reya Vir
Anant Sahai
41
0
0
27 Jul 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
59
11
0
25 Jul 2024
Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu
Yue Huang
Yixin Liu
Xiang Li
Pan Zhou
Lichao Sun
SILM
40
1
0
23 Jul 2024
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
Blazej Manczak
Eliott Zemour
Eric Lin
Vaikkunth Mugunthan
26
2
0
23 Jul 2024
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
Shi Lin
Rongchang Li
Xun Wang
Changting Lin
Xun Wang
Wenpeng Xing
Meng Han
Meng Han
63
3
0
23 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
57
10
0
20 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
50
27
0
16 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
39
82
0
05 Jul 2024
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy
Pedro M. Esperança
Silviu Vlad Oprea
Mete Ozay
KELM
36
2
0
03 Jul 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou
Henry Peng Zou
Barbara Maria Di Eugenio
Yang Zhang
HILM
LRM
52
1
0
01 Jul 2024
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang
Wanxu Zhao
Rui Zheng
Huijie Lv
Shihan Dou
...
Junjie Ye
Yuming Yang
Tao Gui
Qi Zhang
Xuanjing Huang
LLMSV
AAML
52
7
0
26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods
Roy Xie
Junlin Wang
Ruomin Huang
Minxing Zhang
Rong Ge
Jian Pei
Neil Zhenqiang Gong
Bhuwan Dhingra
MIALM
45
13
0
23 Jun 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
Siyuan Wang
Zhuohan Long
Zhihao Fan
Zhongyu Wei
42
6
0
21 Jun 2024
Prompt Injection Attacks in Defended Systems
Daniil Khomsky
Narek Maloyan
Bulat Nutfullin
AAML
SILM
30
3
0
20 Jun 2024
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Yihuai Hong
Lei Yu
Shauli Ravfogel
Haiqin Yang
Mor Geva
KELM
MU
66
18
0
17 Jun 2024
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
Yongting Zhang
Lu Chen
Guodong Zheng
Yifeng Gao
Rui Zheng
...
Yu Qiao
Xuanjing Huang
Feng Zhao
Tao Gui
Jing Shao
VLM
85
24
0
17 Jun 2024
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
Xuan Chen
Yuzhou Nie
Lu Yan
Yunshu Mao
Wenbo Guo
Xiangyu Zhang
33
7
0
13 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
47
10
0
13 Jun 2024
Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey
Shang Wang
Tianqing Zhu
Bo Liu
Ming Ding
Xu Guo
Dayong Ye
Wanlei Zhou
Philip S. Yu
PILM
67
17
0
12 Jun 2024
Previous
1
2
3
4
Next