Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
v1
v2 (latest)
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (3937★)
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 1,101 papers shown
Title
Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models
Dhruv Dhamani
Mary Lou Maher
LLMAG
96
1
0
14 Jan 2025
Lessons From Red Teaming 100 Generative AI Products
Blake Bullwinkel
Amanda Minnich
Shiven Chawla
Gary Lopez
Martin Pouliot
...
Pete Bryan
Ram Shankar Siva Kumar
Yonatan Zunger
Chang Kawaguchi
Mark Russinovich
AAML
VLM
84
7
0
13 Jan 2025
Safeguarding System Prompts for LLMs
Zhifeng Jiang
Zhihua Jin
Guoliang He
AAML
SILM
161
2
0
10 Jan 2025
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
Fengxiang Wang
Ranjie Duan
Peng Xiao
Xiaojun Jia
Shiji Zhao
...
Hang Su
Jialing Tao
Hui Xue
Jun Zhu
Hui Xue
LLMAG
93
10
0
08 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
120
11
0
08 Jan 2025
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Yang Ouyang
Hengrui Gu
Shuhang Lin
Wenyue Hua
Jie Peng
B. Kailkhura
Tianlong Chen
Kaixiong Zhou
Kaixiong Zhou
AAML
117
3
0
05 Jan 2025
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
Mingjie Li
Wai Man Si
Michael Backes
Yang Zhang
Yisen Wang
118
19
0
03 Jan 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Sihang Li
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
137
1
0
03 Jan 2025
Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines
Xiyang Hu
AAML
133
1
0
01 Jan 2025
B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions
Hao Zhang
Wenqi Shao
Hong Liu
Yongqiang Ma
Ping Luo
Yu Qiao
Kaipeng Zhang
Kai Zhang
VLM
AAML
50
18
0
31 Dec 2024
GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search
Matan Ben-Tov
Mahmood Sharif
RALM
207
1
0
31 Dec 2024
Enhancing AI Safety Through the Fusion of Low Rank Adapters
Satya Swaroop Gudipudi
Sreeram Vipparla
Harpreet Singh
Shashwat Goel
Ponnurangam Kumaraguru
MoMe
AAML
86
3
0
30 Dec 2024
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
76
7
0
24 Dec 2024
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang
Hao Li
Junda Zhu
Xinyuan Wang
Changzai Pan
Minlie Huang
Lei Sha
349
0
0
23 Dec 2024
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum
Christoph Schnabl
Gabor Hollbeck
Silas Alberti
Philip Blinde
Marvin von Hagen
145
2
0
22 Dec 2024
The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents
Feiran Jia
Tong Wu
Xin Qin
Anna Squicciarini
LLMAG
AAML
149
7
0
21 Dec 2024
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
Nilanjana Das
Edward Raff
Aman Chadha
Manas Gaur
AAML
231
1
0
20 Dec 2024
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
207
0
0
19 Dec 2024
Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation
Minkyoung Kim
Yunha Kim
Hyeram Seo
Heejung Choi
Jiye Han
...
Hyoje Jung
Byeolhee Kim
Young-Hak Kim
Sanghyun Park
Tae Joon Jun
AAML
126
0
0
18 Dec 2024
Adversarial Hubness in Multi-Modal Retrieval
Tingwei Zhang
Fnu Suya
Rishi Jha
Collin Zhang
Vitaly Shmatikov
AAML
176
1
0
18 Dec 2024
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes
Marco Christiani
David Shriver
Marissa Connor
KELM
124
4
0
17 Dec 2024
Jailbreaking? One Step Is Enough!
Weixiong Zheng
Peijian Zeng
Yuchen Li
Hongyan Wu
Nankai Lin
Jianfei Chen
Aimin Yang
Yimiao Zhou
AAML
113
0
0
17 Dec 2024
LLMs Can Simulate Standardized Patients via Agent Coevolution
Zhuoyun Du
Lujie Zheng
Renjun Hu
Yuyang Xu
Xiaochen Li
Ying Sun
Wei Chen
Jian Wu
Haolei Cai
Haohao Ying
LM&MA
115
5
0
16 Dec 2024
The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang
Yingkang Wang
Shiyao Cui
Pei Ke
J. Tang
176
1
0
15 Dec 2024
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
Zhiyu Xue
Guangliang Liu
Bocheng Chen
K. Johnson
Ramtin Pedarsani
AAML
137
0
0
13 Dec 2024
FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
Bocheng Chen
Hanqing Guo
Qiben Yan
AAML
123
1
0
10 Dec 2024
On Evaluating the Durability of Safeguards for Open-Weight LLMs
Xiangyu Qi
Boyi Wei
Nicholas Carlini
Yangsibo Huang
Tinghao Xie
Luxi He
Matthew Jagielski
Milad Nasr
Prateek Mittal
Peter Henderson
AAML
137
22
0
10 Dec 2024
PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips
Zachary Coalson
Jeonghyun Woo
Shiyang Chen
Yu Sun
Lishan Yang
Prashant J. Nair
Bo Fang
Sanghyun Hong
AAML
138
3
0
10 Dec 2024
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation
Xuying Li
Zhuo Li
Yuji Kosuga
Yasuhiro Yoshida
Victor Bian
AAML
121
2
0
05 Dec 2024
Time-Reversal Provides Unsupervised Feedback to LLMs
Yerram Varun
Rahul Madhavan
Sravanti Addepalli
A. Suggala
Karthikeyan Shanmugam
Prateek Jain
LRM
SyDa
103
0
0
03 Dec 2024
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Erick Galinkin
Martin Sablotny
116
3
0
02 Dec 2024
Yi-Lightning Technical Report
01. AI
:
Alan Wake
Albert Wang
Bei Chen
...
Yuxuan Sha
Zhaodong Yan
Zhiyuan Liu
Zirui Zhang
Zonghong Dai
OSLM
211
4
0
02 Dec 2024
Quantized Delta Weight Is Safety Keeper
Yule Liu
Zhen Sun
Xinlei He
Xinyi Huang
141
6
0
29 Nov 2024
On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code
Md. Imran Hossen
X. Hei
AAML
ELM
105
0
0
29 Nov 2024
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
Shenghui Li
Edith C.H. Ngai
Fanghua Ye
Thiemo Voigt
SILM
197
6
0
28 Nov 2024
Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication
Davide Garassino
Vivana Masia
Nicola Brocca
Alice Delorme Benites
82
0
0
27 Nov 2024
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao
Bryan Hooi
Qingbin Liu
Kai-Wei Chang
Zi Huang
Yujun Cai
AAML
181
3
0
27 Nov 2024
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
Zhi-Yi Chin
Kuan-Chen Mu
Mario Fritz
Pin-Yu Chen
DiffM
192
1
0
25 Nov 2024
RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks
Changyue Jiang
Xudong Pan
Geng Hong
Chenfu Bao
Min Yang
SILM
118
15
0
21 Nov 2024
Global Challenge for Safe and Secure LLMs Track 1
Xiaojun Jia
Yihao Huang
Yang Liu
Peng Yan Tan
Weng Kuan Yau
...
Yan Wang
Rick Siow Mong Goh
Liangli Zhen
Yingjie Zhang
Zhe Zhao
ELM
AILaw
105
0
0
21 Nov 2024
Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation
Ke Zhao
Huayang Huang
Miao Li
Yu Wu
AAML
114
1
0
21 Nov 2024
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Nay Myat Min
Long H. Pham
Yige Li
Jun Sun
AAML
156
5
0
18 Nov 2024
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Blake Bullwinkel
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
162
18
0
18 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
130
7
0
17 Nov 2024
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Jianfeng Chi
Ujjwal Karn
Hongyuan Zhan
Eric Michael Smith
Javier Rando
Yiming Zhang
Kate Plawiak
Zacharie Delpierre Coudert
Kartikeya Upasani
Mahesh Pasupuleti
MLLM
3DH
124
32
0
15 Nov 2024
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey
Xuannan Liu
Xing Cui
Peipei Li
Zekun Li
Huaibo Huang
Shuhan Xia
Miaoxuan Zhang
Yueying Zou
Ran He
AAML
154
11
0
14 Nov 2024
DROJ: A Prompt-Driven Attack against Large Language Models
Leyang Hu
Boran Wang
57
0
0
14 Nov 2024
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAML
ELM
PILM
112
1
0
12 Nov 2024
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
Yannis Belkhiter
Giulio Zizzo
S. Maffeis
70
3
0
11 Nov 2024
A Survey of AI-Related Cyber Security Risks and Countermeasures in Mobility-as-a-Service
Kai-Fung Chu
Haiyue Yuan
Jinsheng Yuan
Weisi Guo
Nazmiye Balta-Ozkan
Shujun Li
76
3
0
08 Nov 2024
Previous
1
2
3
...
7
8
9
...
21
22
23
Next