Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
v1
v2 (latest)
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (3937★)
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 1,101 papers shown
Title
A generative approach to LLM harmfulness detection with special red flag tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mohfakhami
Leo Schwinn
Gauthier Gidel
184
2
0
22 Feb 2025
Eliminating Backdoors in Neural Code Models for Secure Code Understanding
Weisong Sun
Yuchen Chen
Chunrong Fang
Yebo Feng
Yuan Xiao
An Guo
Quanjun Zhang
Yang Liu
Baowen Xu
Zhenyu Chen
AAML
142
1
0
21 Feb 2025
Robust Concept Erasure Using Task Vectors
Minh Pham
Kelly O. Marshall
Chinmay Hegde
Niv Cohen
190
20
0
21 Feb 2025
Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization
Fernando Spadea
Oshani Seneviratne
78
1
0
21 Feb 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
139
0
0
21 Feb 2025
TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
Aman Goel
Xian Carrie Wu
Zhe Wang
Dmitriy Bespalov
Yanjun Qi
108
0
0
21 Feb 2025
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Shuo Tang
Xianghe Pang
Zexi Liu
Bohan Tang
Guangyi Liu
Xiaowen Dong
Yanjie Wang
Yanfeng Wang
Tian Jin
SyDa
LLMAG
233
7
0
21 Feb 2025
UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
Vaidehi Patil
Elias Stengel-Eskin
Joey Tianyi Zhou
MU
CLL
114
4
0
20 Feb 2025
Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Haokun Chen
Sebastian Szyller
Weilin Xu
N. Himayat
MU
AAML
93
1
0
20 Feb 2025
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Junda Zhu
Lingyong Yan
Shuaiqiang Wang
Dawei Yin
Lei Sha
AAML
LRM
98
6
0
18 Feb 2025
SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Junkai Chen
Zhijie Deng
Kening Zheng
Yibo Yan
Shuliang Liu
PeiJun Wu
Peijie Jiang
Qingbin Liu
Xuming Hu
MU
112
8
0
18 Feb 2025
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
Huawei Lin
Yingjie Lao
Tong Geng
Tan Yu
Weijie Zhao
AAML
SILM
140
3
0
18 Feb 2025
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
Weikai Lu
Hao Peng
Huiping Zhuang
Cen Chen
Huiping Zhuang
84
0
0
18 Feb 2025
Computational Safety for Generative AI: A Signal Processing Perspective
Pin-Yu Chen
124
1
0
18 Feb 2025
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
Seanie Lee
Dong Bok Lee
Dominik Wagner
Minki Kang
Haebin Seong
Tobias Bocklet
Juho Lee
Sung Ju Hwang
134
2
0
18 Feb 2025
Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models
Yue Xu
Chengyan Fu
Li Xiong
Sibei Yang
Wenjie Wang
117
0
0
17 Feb 2025
Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training
Fenghua Weng
Jian Lou
Jun Feng
Minlie Huang
Wenjie Wang
AAML
160
2
0
17 Feb 2025
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Yingshui Tan
Yilei Jiang
Yongbin Li
Qingbin Liu
Xingyuan Bu
Wenbo Su
Xiangyu Yue
Xiaoyong Zhu
Bo Zheng
ALM
155
6
0
17 Feb 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang
Zhangchen Xu
Yuetai Li
Luyao Niu
Zhen Xiang
Yue Liu
Bill Yuchen Lin
Radha Poovendran
KELM
ELM
LRM
157
28
0
17 Feb 2025
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
Yi Wang
Fenghua Weng
Shangshang Yang
Zhan Qin
Minlie Huang
Wenjie Wang
KELM
AAML
119
1
0
17 Feb 2025
Fast Proxies for LLM Robustness Evaluation
Tim Beyer
Jan Schuchardt
Leo Schwinn
Stephan Günnemann
AAML
105
0
0
14 Feb 2025
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference
Roman Levin
Valeriia Cherepanova
Abhimanyu Hans
Avi Schwarzschild
Tom Goldstein
416
1
0
14 Feb 2025
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks
Ang Li
Yin Zhou
Vethavikashini Chithrra Raghuram
Tom Goldstein
Micah Goldblum
AAML
178
14
0
12 Feb 2025
Trustworthy AI: Safety, Bias, and Privacy -- A Survey
Xingli Fang
Jianwei Li
Varun Mulchandani
Jung-Eun Kim
85
0
0
11 Feb 2025
Universal Adversarial Attack on Aligned Multimodal LLMs
Temurbek Rahmatullaev
Polina Druzhinina
Nikita Kurdiukov
Matvey Mikhalchuk
Andrey Kuznetsov
Anton Razzhigaev
AAML
221
0
0
11 Feb 2025
DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities
Chashi Mahiul Islam
Samuel Jacob Chacko
Preston Horne
Xiuwen Liu
165
2
0
11 Feb 2025
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
Shenyi Zhang
Yuchen Zhai
Keyan Guo
Hongxin Hu
Shengnan Guo
Zheng Fang
Lingchen Zhao
Chao Shen
Cong Wang
Qian Wang
AAML
143
4
0
11 Feb 2025
Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails
Yijun Yang
L. Wang
Xiao Yang
Lanqing Hong
Jun Zhu
AAML
75
0
0
09 Feb 2025
OntoTune: Ontology-Driven Self-training for Aligning Large Language Models
Zhiqiang Liu
Chengtao Gan
Junjie Wang
Yanzhe Zhang
Zhongpu Bo
Mengshu Sun
Ningyu Zhang
Wen Zhang
120
2
0
08 Feb 2025
Extracting and Understanding the Superficial Knowledge in Alignment
Runjin Chen
Gabriel Jacob Perin
Xuxi Chen
Xilun Chen
Y. Han
Nina S. T. Hirata
Junyuan Hong
B. Kailkhura
67
1
0
07 Feb 2025
Safety Reasoning with Guidelines
Haoyu Wang
Zeyu Qin
Li Shen
Xueqian Wang
Minhao Cheng
Dacheng Tao
182
4
0
06 Feb 2025
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
Buyun Liang
Kwan Ho Ryan Chan
D. Thaker
Jinqi Luo
René Vidal
AAML
101
0
0
05 Feb 2025
STAIR: Improving Safety Alignment with Introspective Reasoning
Yuanhang Zhang
Siyuan Zhang
Yao Huang
Zeyu Xia
Zhengwei Fang
Xiao Yang
Ranjie Duan
Dong Yan
Yinpeng Dong
Jun Zhu
LRM
LLMSV
154
7
0
04 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAML
ELM
139
9
0
04 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
181
7
0
03 Feb 2025
Breaking Focus: Contextual Distraction Curse in Large Language Models
Yue Huang
Yanbo Wang
Zixiang Xu
Chujie Gao
Siyuan Wu
Jiayi Ye
Preslav Nakov
Pin-Yu Chen
Wei Wei
AAML
109
4
0
03 Feb 2025
Harmful Terms and Where to Find Them: Measuring and Modeling Unfavorable Financial Terms and Conditions in Shopping Websites at Scale
Elisa Tsai
Neal Mangaokar
Boyuan Zheng
Haizhong Zheng
A. Prakash
80
0
0
03 Feb 2025
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Ziyi Yin
Yuanpu Cao
Han Liu
Ting Wang
Jinghui Chen
Fenhlong Ma
AAML
97
1
0
02 Feb 2025
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
Isha Gupta
David Khachaturov
Robert D. Mullins
AAML
AuLLM
117
4
0
02 Feb 2025
Trading Inference-Time Compute for Adversarial Robustness
Wojciech Zaremba
Evgenia Nitishinskaya
Boaz Barak
Stephanie Lin
Sam Toyer
...
Rachel Dias
Eric Wallace
Kai Y. Xiao
Johannes Heidecke
Amelia Glaese
LRM
AAML
167
26
0
31 Jan 2025
Token Democracy: The Architectural Limits of Alignment in Transformer-Based Language Models
Robin Young
57
0
0
28 Jan 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
210
18
0
28 Jan 2025
Smoothed Embeddings for Robust Language Models
Ryo Hase
Md Rafi Ur Rashid
Ashley Lewis
Jing Liu
T. Koike-Akino
K. Parsons
Yanjie Wang
AAML
116
2
0
27 Jan 2025
HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
Zihui Wu
Haichang Gao
Jiacheng Luo
Zhaoxiang Liu
155
0
0
23 Jan 2025
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
Melissa Kazemi Rad
Huy Nghiem
Andy Luo
Sahil Wadhwa
Mohammad Sorower
Stephen Rawls
AAML
154
5
0
22 Jan 2025
An Empirically-grounded tool for Automatic Prompt Linting and Repair: A Case Study on Bias, Vulnerability, and Optimization in Developer Prompts
Dhia Elhaq Rzig
Dhruba Jyoti Paul
Kaiser Pister
Jordan Henkel
Foyzul Hassan
135
0
0
21 Jan 2025
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
Wuyuao Mai
Geng Hong
Pei Chen
Xudong Pan
Baojun Liu
Y. Zhang
Haixin Duan
Min Yang
AAML
127
1
0
21 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
133
59
0
20 Jan 2025
Differentiable Adversarial Attacks for Marked Temporal Point Processes
Pritish Chakraborty
Vinayak Gupta
R. Raj
Srikanta J. Bedathur
A. De
AAML
512
0
0
17 Jan 2025
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
Jonathan Nöther
Adish Singla
Goran Radanović
AAML
163
0
0
14 Jan 2025
Previous
1
2
3
...
6
7
8
...
21
22
23
Next