Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
v1
v2 (latest)
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (3937★)
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 1,101 papers shown
Title
Rethinking LLM Memorization through the Lens of Adversarial Compression
Avi Schwarzschild
Zhili Feng
Pratyush Maini
Zachary Chase Lipton
J. Zico Kolter
134
56
0
23 Apr 2024
A Survey on the Real Power of ChatGPT
Ming Liu
Ran Liu
Ye Zhu
Hua Wang
Youyang Qu
Rongsheng Li
Yongpan Sheng
Wray Buntine
132
3
0
22 Apr 2024
Holistic Safety and Responsibility Evaluations of Advanced AI Models
Laura Weidinger
Joslyn Barnhart
Jenny Brennan
Christina Butterfield
Susie Young
...
Sebastian Farquhar
Lewis Ho
Iason Gabriel
Allan Dafoe
William S. Isaac
ELM
90
9
0
22 Apr 2024
Protecting Your LLMs with Information Bottleneck
Zichuan Liu
Zefan Wang
Linjie Xu
Jinyu Wang
Lei Song
Tianchun Wang
Chunlin Chen
Wei Cheng
Jiang Bian
KELM
AAML
116
18
0
22 Apr 2024
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Javier Rando
Francesco Croce
Kryvstof Mitka
Stepan Shabalin
Maksym Andriushchenko
Nicolas Flammarion
F. Tramèr
90
17
0
22 Apr 2024
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
Narek Maloyan
Ekansh Verma
Bulat Nutfullin
Bislan Ashinov
74
8
0
21 Apr 2024
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus
Arman Zharmagambetov
Chuan Guo
Brandon Amos
Yuandong Tian
AAML
142
67
0
21 Apr 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
122
141
0
19 Apr 2024
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing
Jiabao Ji
Bairu Hou
Zhen Zhang
Guanhua Zhang
Wenqi Fan
Qing Li
Yang Zhang
Gaowen Liu
Sijia Liu
Shiyu Chang
AAML
76
8
0
18 Apr 2024
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Zhihao Xu
Ruixuan Huang
Changyu Chen
Shuai Wang
Xiting Wang
LLMSV
101
27
0
18 Apr 2024
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
Christian Tomani
Kamalika Chaudhuri
Ivan Evtimov
Daniel Cremers
Mark Ibrahim
107
15
0
16 Apr 2024
Forcing Diffuse Distributions out of Language Models
Yiming Zhang
Avi Schwarzschild
Nicholas Carlini
Zico Kolter
Daphne Ippolito
ALM
DiffM
112
20
0
16 Apr 2024
Private Attribute Inference from Images with Vision-Language Models
Batuhan Tömekçe
Mark Vero
Robin Staab
Martin Vechev
VLM
PILM
92
10
0
16 Apr 2024
Learn Your Reference Model for Real Good Alignment
Alexey Gorbatovski
Boris Shaposhnikov
Alexey Malakhov
Nikita Surnachev
Yaroslav Aksenov
Ian Maksimov
Nikita Balagansky
Daniil Gavrilov
OffRL
129
35
0
15 Apr 2024
Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies
Brian Bartoldson
James Diffenderfer
Konstantinos Parasyris
B. Kailkhura
AAML
134
19
0
14 Apr 2024
Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts
Tianyu Zhang
Zixuan Zhao
Jiaqi Huang
Jingyu Hua
Sheng Zhong
34
0
0
12 Apr 2024
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
Yingchaojie Feng
Zhizhang Chen
Zhining Kang
Sijia Wang
Haoyu Tian
Wei Zhang
Minfeng Zhu
Wei Chen
116
4
0
12 Apr 2024
LLM Agents can Autonomously Exploit One-day Vulnerabilities
Richard Fang
R. Bindu
Akul Gupta
Daniel Kang
SILM
LLMAG
147
67
0
11 Apr 2024
Latent Guard: a Safety Framework for Text-to-image Generation
Runtao Liu
Ashkan Khakzar
Jindong Gu
Qifeng Chen
Philip Torr
Fabio Pizzati
96
31
0
11 Apr 2024
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
Zeyi Liao
Huan Sun
AAML
100
96
0
11 Apr 2024
Best Practices and Lessons Learned on Synthetic Data for Language Models
Ruibo Liu
Jerry W. Wei
Fangyu Liu
Chenglei Si
Yanzhe Zhang
...
Steven Zheng
Daiyi Peng
Diyi Yang
Denny Zhou
Andrew M. Dai
SyDa
EgoV
126
96
0
11 Apr 2024
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay
Vahid Behzadan
AAML
75
16
0
09 Apr 2024
Rethinking How to Evaluate Language Model Jailbreak
Hongyu Cai
Arjun Arunasalam
Leo Y. Lin
Antonio Bianchi
Z. Berkay Celik
ALM
65
8
0
09 Apr 2024
On adversarial training and the 1 Nearest Neighbor classifier
Amir Hagai
Yair Weiss
AAML
88
0
0
09 Apr 2024
CodecLM: Aligning Language Models with Tailored Synthetic Data
Zifeng Wang
Chun-Liang Li
Vincent Perot
Long T. Le
Jin Miao
Zizhao Zhang
Chen-Yu Lee
Tomas Pfister
SyDa
ALM
73
21
0
08 Apr 2024
Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model Merging
Tianshuo Cong
Delong Ran
Zesen Liu
Xinlei He
Jinyuan Liu
Yichen Gong
Qi Li
Anyu Wang
Xiaoyun Wang
MoMe
75
8
0
08 Apr 2024
Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra
Darioush Kevian
U. Syed
Xing-ming Guo
Aaron J. Havens
Geir Dullerud
Peter M. Seiler
Lianhui Qin
Bin Hu
ELM
106
33
0
04 Apr 2024
Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity
Jacob Varley
Sumeet Singh
Deepali Jain
Krzysztof Choromanski
Andy Zeng
Somnath Basu Roy Chowdhury
Kumar Avinava Dubey
Vikas Sindhwani
LM&Ro
76
15
0
04 Apr 2024
Vocabulary Attack to Hijack Large Language Model Applications
Patrick Levi
Christoph P. Neumann
AAML
72
11
0
03 Apr 2024
Risks from Language Models for Automated Mental Healthcare: Ethics and Structure for Implementation
Declan Grabb
Max Lamparth
N. Vasan
89
17
0
02 Apr 2024
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
M. Russinovich
Ahmed Salem
Ronen Eldan
118
98
0
02 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
204
222
0
02 Apr 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models
Ang Lv
Yuhan Chen
Kaiyi Zhang
Yulong Wang
Lifeng Liu
Ji-Rong Wen
Jian Xie
Rui Yan
KELM
76
18
0
28 Mar 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Alexander Robey
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALM
ELM
AAML
131
143
0
28 Mar 2024
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Yutong He
Alexander Robey
Naoki Murata
Yiding Jiang
J. Williams
George Pappas
Hamed Hassani
Yuki Mitsufuji
Ruslan Salakhutdinov
J. Zico Kolter
DiffM
151
5
0
28 Mar 2024
Can multiple-choice questions really be useful in detecting the abilities of LLMs?
Wangyue Li
Liangzhi Li
Tong Xiang
Xiao Liu
Wei Deng
Noa Garcia
ELM
118
35
0
26 Mar 2024
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi
Zenghui Yuan
Yinuo Liu
Yue Huang
Pan Zhou
Lichao Sun
Neil Zhenqiang Gong
AAML
146
57
0
26 Mar 2024
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition
Georgios Chochlakis
Alexandros Potamianos
Kristina Lerman
Shrikanth Narayanan
90
7
0
25 Mar 2024
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Erik Miehling
Manish Nagireddy
P. Sattigeri
Elizabeth M. Daly
David Piorkowski
John T. Richards
ALM
110
15
0
22 Mar 2024
Testing the Limits of Jailbreaking Defenses with the Purple Problem
Taeyoun Kim
Suhas Kotha
Aditi Raghunathan
AAML
91
6
0
20 Mar 2024
Defending Against Indirect Prompt Injection Attacks With Spotlighting
Keegan Hines
Gary Lopez
Matthew Hall
Federico Zarfati
Yonatan Zunger
Emre Kiciman
AAML
SILM
97
51
0
20 Mar 2024
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?
Anjun Hu
Jindong Gu
Francesco Pinto
Konstantinos Kamnitsas
Philip Torr
AAML
SILM
86
5
0
19 Mar 2024
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
Zhuowen Yuan
Zidi Xiong
Yi Zeng
Ning Yu
Ruoxi Jia
Basel Alomair
Yue Liu
AAML
KELM
130
45
0
19 Mar 2024
Embodied LLM Agents Learn to Cooperate in Organized Teams
Xudong Guo
Kaixuan Huang
Jiale Liu
Wenhui Fan
Natalia Vélez
Qingyun Wu
Huazheng Wang
Thomas L. Griffiths
Mengdi Wang
LM&Ro
LLMAG
140
49
0
19 Mar 2024
SelfIE: Self-Interpretation of Large Language Model Embeddings
Haozhe Chen
Carl Vondrick
Chengzhi Mao
65
27
0
16 Mar 2024
Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning
Dongmin Park
Zhaofang Qian
Guangxing Han
Ser-Nam Lim
MLLM
84
0
0
15 Mar 2024
Towards White Box Deep Learning
Maciej Satkiewicz
AAML
88
1
0
14 Mar 2024
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
132
56
0
14 Mar 2024
An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models
Haochen Luo
Jindong Gu
Fengyuan Liu
Philip Torr
VLM
VPVLM
AAML
84
24
0
14 Mar 2024
A Moral Imperative: The Need for Continual Superalignment of Large Language Models
Gokul Puthumanaillam
Manav Vora
Pranay Thangeda
Melkior Ornik
87
7
0
13 Mar 2024
Previous
1
2
3
...
15
16
17
...
21
22
23
Next