Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.06387
Cited By
v1
v2
v3 (latest)
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
10 October 2023
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations"
50 / 80 papers shown
Title
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Biao Yi
Tiansheng Huang
Sishuo Chen
Tong Li
Zheli Liu
Zhixuan Chu
Yiming Li
AAML
24
9
0
19 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
88
0
0
11 Jun 2025
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
Yunjia Qi
Xiaozhi Wang
Bin Xu
Lei Hou
Juanzi Li
OffRL
70
0
0
11 Jun 2025
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
Haoyang Li
Huan Gao
Zhiyuan Zhao
Zhiyu Lin
Junyu Gao
Xuelong Li
AAML
21
0
0
09 Jun 2025
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
T. Krauß
Hamid Dashtbani
Alexandra Dmitrienko
17
0
0
09 Jun 2025
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue
Reza Abbasi-Asl
Ramtin Pedarsani
AAML
23
0
0
08 Jun 2025
A Trustworthiness-based Metaphysics of Artificial Intelligence Systems
Andrea Ferrario
36
0
0
03 Jun 2025
Learning Safety Constraints for Large Language Models
Xin Chen
Yarden As
Andreas Krause
28
0
0
30 May 2025
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Wenhan Yang
Spencer Stice
Ali Payani
Baharan Mirzasoleiman
MLLM
25
0
0
30 May 2025
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
Sahil Verma
Keegan E. Hines
J. Bilmes
Charlotte Siska
Luke Zettlemoyer
Hila Gonen
Chandan Singh
AAML
22
0
0
29 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
27
0
0
26 May 2025
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
Sangyeop Kim
Yohan Lee
Yongwoo Song
Kimin Lee
AAML
21
0
0
26 May 2025
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Harethah Shairah
Hasan Hammoud
Bernard Ghanem
G. Turkiyyah
63
0
0
25 May 2025
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
Jun Zhuang
Haibo Jin
Ye Zhang
Zhengjian Kang
Wenbin Zhang
Gaby G. Dagher
Haohan Wang
AAML
82
0
0
24 May 2025
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
Huanran Chen
Yinpeng Dong
Zeming Wei
Yao Huang
Yichi Zhang
Hang Su
Jun Zhu
MoMe
90
1
0
23 May 2025
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models
Jiawei Kong
Hao Fang
Xiaochen Yang
Kuofeng Gao
Bin Chen
Shu-Tao Xia
Yaowei Wang
Min Zhang
AAML
74
0
0
23 May 2025
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov
Nurdaulet Mukhituly
Kentaro Inui
Hanan Aldarmaki
Nils Lukas
AAML
87
0
0
18 May 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Yidan Wang
Yanan Cao
Yubing Ren
Fang Fang
Zheng Lin
Binxing Fang
PILM
122
0
0
15 May 2025
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
73
0
0
12 May 2025
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Haoming Yang
Ke Ma
Xiaojun Jia
Yingfei Sun
Qianqian Xu
Qingming Huang
AAML
435
0
0
03 May 2025
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Kai Hu
Weichen Yu
Lefei Zhang
Alexander Robey
Andy Zou
Chengming Xu
Haoqi Hu
Matt Fredrikson
AAML
VLM
130
2
0
02 May 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
159
3
0
23 Apr 2025
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
Ivan Evtimov
Arman Zharmagambetov
Aaron Grattafiori
Chuan Guo
Kamalika Chaudhuri
AAML
116
4
0
22 Apr 2025
Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents
Juhee Kim
Woohyuk Choi
Byoungyoung Lee
LLMAG
138
1
0
17 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
172
5
0
05 Mar 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xinsong Zhang
AAML
385
1
0
27 Feb 2025
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Shiyu Xiang
Ansen Zhang
Yanfei Cao
Yang Fan
Ronghao Chen
AAML
118
2
0
26 Feb 2025
On the Robustness of Transformers against Context Hijacking for Linear Classification
Tianle Li
Chenyang Zhang
Xingwu Chen
Yuan Cao
Difan Zou
126
2
0
24 Feb 2025
GuidedBench: Equipping Jailbreak Evaluation with Guidelines
Ruixuan Huang
Xunguang Wang
Zongjie Li
Daoyuan Wu
Shuai Wang
ALM
ELM
129
0
0
24 Feb 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
135
0
0
21 Feb 2025
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
Yang Yao
Xuan Tong
Ruofan Wang
Yixu Wang
Lujundong Li
Liang Liu
Yan Teng
Yun Wang
LRM
84
7
0
19 Feb 2025
SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Junkai Chen
Zhijie Deng
Kening Zheng
Yibo Yan
Shuliang Liu
PeiJun Wu
Peijie Jiang
Qingbin Liu
Xuming Hu
MU
112
8
0
18 Feb 2025
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models
Shehel Yoosuf
Temoor Ali
Ahmed Lekssays
Mashael Alsabah
Issa M. Khalil
12
1
0
17 Feb 2025
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
Shenyi Zhang
Yuchen Zhai
Keyan Guo
Hongxin Hu
Shengnan Guo
Zheng Fang
Lingchen Zhao
Chao Shen
Cong Wang
Qian Wang
AAML
143
4
0
11 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
210
18
0
28 Jan 2025
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
Melissa Kazemi Rad
Huy Nghiem
Andy Luo
Sahil Wadhwa
Mohammad Sorower
Stephen Rawls
AAML
154
5
0
22 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
141
4
0
20 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
133
59
0
20 Jan 2025
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
Mingjie Li
Wai Man Si
Michael Backes
Yang Zhang
Yisen Wang
118
19
0
03 Jan 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Sihang Li
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
137
1
0
03 Jan 2025
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao
Kejiang Chen
Weinan Zhang
Nenghai Yu
AAML
138
0
0
03 Nov 2024
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
Yulin Chen
Haoran Li
Zihao Zheng
Yangqiu Song
Dekai Wu
Bryan Hooi
SILM
AAML
183
7
0
01 Nov 2024
What is Wrong with Perplexity for Long-context Language Modeling?
Lizhe Fang
Yifei Wang
Zhaoyang Liu
Chenheng Zhang
Stefanie Jegelka
Jinyang Gao
Bolin Ding
Yisen Wang
153
13
0
31 Oct 2024
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
Vishal Kumar
Zeyi Liao
Jaylen Jones
Huan Sun
AAML
121
3
0
29 Oct 2024
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Honglin Mu
Han He
Yuxin Zhou
Yunlong Feng
Yang Xu
...
Zeming Liu
Xudong Han
Qi Shi
Qingfu Zhu
Wanxiang Che
AAML
91
1
0
28 Oct 2024
LLMScan: Causal Scan for LLM Misbehavior Detection
Mengdi Zhang
Kai Kiat Goh
Peixin Zhang
Jun Sun
Rose Lin Xin
Hongyu Zhang
155
0
0
22 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
143
1
0
15 Oct 2024
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu
Shujian Zhang
Kaiqiang Song
Silei Xu
Sanqiang Zhao
Ravi Agrawal
Sathish Indurthi
Chong Xiang
Prateek Mittal
Wenxuan Zhou
112
14
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
89
13
0
09 Oct 2024
Non-Halting Queries: Exploiting Fixed Points in LLMs
Ghaith Hammouri
Kemal Derya
B. Sunar
72
0
0
08 Oct 2024
1
2
Next