Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
v1
v2 (latest)
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (3937★)
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 1,101 papers shown
Title
Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models
Saketh Bachu
Erfan Shayegani
Trishna Chakraborty
Rohit Lal
Arindam Dutta
Chengyu Song
Yue Dong
Nael B. Abu-Ghazaleh
Amit K. Roy-Chowdhury
57
0
0
06 Nov 2024
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
487
1
0
06 Nov 2024
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
Jason Vega
Junsheng Huang
Gaokai Zhang
Hangoo Kang
Minjia Zhang
Gagandeep Singh
76
1
0
05 Nov 2024
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang
Mengxi Gao
Yibo Yan
Xin Zou
Yanggan Gu
Aiwei Liu
Xuming Hu
92
6
0
05 Nov 2024
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Yuxin Xiao
Chaoqun Wan
Yonggang Zhang
Wenxiao Wang
Binbin Lin
Xiaofei He
Xu Shen
Jieping Ye
49
0
0
04 Nov 2024
Attacking Vision-Language Computer Agents via Pop-ups
Yanzhe Zhang
Tao Yu
Diyi Yang
AAML
VLM
140
34
0
04 Nov 2024
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Eric Ma
Gaurav Verma
Srijan Kumar
125
7
0
03 Nov 2024
Achieving Domain-Independent Certified Robustness via Knowledge Continuity
Alan Sun
Chiyu Ma
Kenneth Ge
Soroush Vosoughi
61
1
0
03 Nov 2024
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao
Kejiang Chen
Weinan Zhang
Nenghai Yu
AAML
140
0
0
03 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch
Constantin Weisser
Severin Field
Helen Yannakoudakis
Stephen Casper
86
6
0
02 Nov 2024
Plentiful Jailbreaks with String Compositions
Brian R. Y. Huang
AAML
91
2
0
01 Nov 2024
Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Zhipeng Wei
Yuqi Liu
N. Benjamin Erichson
AAML
81
1
0
01 Nov 2024
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
Yulin Chen
Haoran Li
Zihao Zheng
Yangqiu Song
Dekai Wu
Bryan Hooi
SILM
AAML
187
7
0
01 Nov 2024
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs
Muhammed Saeed
Elgizouli Mohamed
Mukhtar Mohamed
Shaina Raza
Muhammad Abdul-Mageed
Shady Shehata
89
0
0
31 Oct 2024
Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models
Yiqi Yang
Hongye Fu
AAML
49
0
0
31 Oct 2024
Transformation-Invariant Learning and Theoretical Guarantees for OOD Generalization
Omar Montasser
Han Shao
Emmanuel Abbe
OOD
58
2
0
30 Oct 2024
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
Youcheng Huang
Fengbin Zhu
Jingkun Tang
Pan Zhou
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
AAML
69
4
0
30 Oct 2024
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
Yutao Mou
Shikun Zhang
Wei Ye
ELM
84
16
0
29 Oct 2024
Enhancing Adversarial Attacks through Chain of Thought
Jingbo Su
LRM
34
3
0
29 Oct 2024
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
Zhihao Liu
Chenhui Hu
ALM
ELM
75
1
0
29 Oct 2024
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
Wanhua Li
Zibin Meng
Jiawei Zhou
D. Wei
Chuang Gan
Hanspeter Pfister
LRM
VLM
58
6
0
28 Oct 2024
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao
Xiang Zheng
Lin Luo
Yige Li
Xingjun Ma
Yu-Gang Jiang
VLM
AAML
113
7
0
28 Oct 2024
Reducing the Scope of Language Models
David Yunis
Siyu Huo
Chulaka Gunasekara
Danish Contractor
KELM
34
0
0
28 Oct 2024
Adversarial Attacks on Large Language Models Using Regularized Relaxation
Samuel Jacob Chacko
Sajib Biswas
Chashi Mahiul Islam
Fatema Tabassum Liza
Xiuwen Liu
AAML
82
3
0
24 Oct 2024
ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
Chung-En Sun
Xiaodong Liu
Weiwei Yang
Tsui-Wei Weng
Hao Cheng
Aidan San
Michel Galley
J. Gao
135
2
0
24 Oct 2024
MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
Juyong Lee
Dongyoon Hahm
June Suk Choi
W. Bradley Knox
Kimin Lee
LLMAG
ELM
AAML
LM&Ro
80
9
0
23 Oct 2024
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Samuele Poppi
Zheng-Xin Yong
Yifei He
Bobbie Chern
Han Zhao
Aobo Yang
Jianfeng Chi
AAML
161
21
0
23 Oct 2024
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres
Laura Ruis
Ekdeep Singh Lubana
David M. Krueger
LLMSV
75
10
0
22 Oct 2024
VoiceBench: Benchmarking LLM-Based Voice Assistants
Yiming Chen
Xianghu Yue
Chen Zhang
Xiaoxue Gao
R. Tan
Haoyang Li
ELM
AuLLM
118
29
0
22 Oct 2024
Remote Timing Attacks on Efficient Language Model Inference
Nicholas Carlini
Milad Nasr
75
3
0
22 Oct 2024
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
Itay Nakash
George Kour
Guy Uziel
Ateret Anaby-Tavor
AAML
LLMAG
68
7
0
22 Oct 2024
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration
Qintong Li
Jiahui Gao
Sheng Wang
Renjie Pi
Xueliang Zhao
Chuan Wu
Xin Jiang
Zhiyu Li
Lingpeng Kong
SyDa
103
3
0
22 Oct 2024
LLMScan: Causal Scan for LLM Misbehavior Detection
Mengdi Zhang
Kai Kiat Goh
Peixin Zhang
Jun Sun
Rose Lin Xin
Hongyu Zhang
157
0
0
22 Oct 2024
Bayesian scaling laws for in-context learning
Aryaman Arora
Dan Jurafsky
Christopher Potts
Noah D. Goodman
60
5
0
21 Oct 2024
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis
Jonathan Brokman
Omer Hofman
Oren Rachmil
Inderjeet Singh
Vikas Pahuja
Rathina Sabapathy Aishvariya Priya
Amit Giloni
Roman Vainshtein
Hisashi Kojima
67
2
0
21 Oct 2024
NetSafe: Exploring the Topological Safety of Multi-agent Networks
Miao Yu
Shilong Wang
Guibin Zhang
Junyuan Mao
Chenlong Yin
Qijiong Liu
Qingsong Wen
Kun Wang
Yang Wang
82
12
0
21 Oct 2024
Boosting Jailbreak Transferability for Large Language Models
Hanqing Liu
Lifeng Zhou
Huanqian Yan
37
1
0
21 Oct 2024
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Xiao-Li Li
Zhuhong Li
Qiongxiu Li
Bingze Lee
Jinghao Cui
Xiaolin Hu
AAML
58
5
0
20 Oct 2024
GlitchMiner: Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization
Zihui Wu
Haichang Gao
Ping Wang
Shudong Zhang
Zhaoxiang Liu
Kai Wang
62
0
0
19 Oct 2024
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Yujun Zhou
Jingdong Yang
Yue Huang
Kehan Guo
Zoe Emory
...
Tian Gao
Werner Geyer
Nuno Moniz
Nitesh Chawla
Xiangliang Zhang
132
7
0
18 Oct 2024
Boosting LLM Translation Skills without General Ability Loss via Rationale Distillation
Junhong Wu
Yang Zhao
Yangyifan Xu
Bing Liu
Chengqing Zong
CLL
122
2
0
17 Oct 2024
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
82
1
0
17 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Sihang Li
Yongbin Li
160
10
0
17 Oct 2024
Estimating the Probabilities of Rare Outputs in Language Models
Gabriel Wu
Jacob Hilton
AAML
UQCV
137
3
0
17 Oct 2024
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
97
2
0
16 Oct 2024
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Ruimeng Ye
Yang Xiao
Bo Hui
ALM
ELM
OffRL
127
4
0
16 Oct 2024
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
Jaehong Yoon
Shoubin Yu
Vaidehi Patil
Huaxiu Yao
Joey Tianyi Zhou
144
23
0
16 Oct 2024
Multi-round jailbreak attack on large language models
Yihua Zhou
Xiaochuan Shi
AAML
59
0
0
15 Oct 2024
Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models
Hao Yang
Zhuang Li
Ehsan Shareghi
Gholamreza Haffari
AAML
104
2
0
15 Oct 2024
A Theoretical Survey on Foundation Models
Shi Fu
Yuzhu Chen
Yingjie Wang
Dacheng Tao
82
0
0
15 Oct 2024
Previous
1
2
3
...
8
9
10
...
21
22
23
Next