Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
v1
v2 (latest)
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (3937★)
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 1,101 papers shown
Title
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Lei Jiang
Zixun Zhang
Zizhou Wang
Xiaobing Sun
Zhen Li
Liangli Zhen
Xiaohua Xu
AAML
12
0
0
20 Jun 2025
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
Jingtong Su
Julia Kempe
Karen Ullrich
16
0
0
20 Jun 2025
MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
Muyang Zheng
Yuanzhi Yao
C. D. Lin
Rui Wang
Meng Han
AAML
VLM
13
0
0
20 Jun 2025
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Biao Yi
Tiansheng Huang
Sishuo Chen
Tong Li
Zheli Liu
Zhixuan Chu
Yiming Li
AAML
26
9
0
19 Jun 2025
Probing the Robustness of Large Language Models Safety to Latent Perturbations
Tianle Gu
Kexin Huang
Zongqi Wang
Yixu Wang
Jie Li
Yuanqi Yao
Yang Yao
Yujiu Yang
Yan Teng
Yingchun Wang
AAML
LLMSV
31
0
0
19 Jun 2025
From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem
Yanxu Mao
Tiehan Cui
Peipei Liu
Datao You
Hongsong Zhu
AAML
17
0
0
18 Jun 2025
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Gabrel J. Perin
Runjin Chen
Xuxi Chen
Nina S. T. Hirata
Zhangyang Wang
Junyuan Hong
AAML
35
0
0
18 Jun 2025
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma
Yiqiao Jin
Vineeth Rakesh
Yingtong Dou
Menghai Pan
Mahashweta Das
Srijan Kumar
AAML
13
0
0
18 Jun 2025
Doppelganger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
Daewon Kang
YeongHwan Shin
Doyeon Kim
Kyu-Hwan Jung
Meong Hi Son
AAML
SILM
54
0
0
17 Jun 2025
Excessive Reasoning Attack on Reasoning LLMs
Wai Man Si
Mingjie Li
Michael Backes
Yang Zhang
AAML
LRM
27
0
0
17 Jun 2025
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
Aishan Liu
Zonghao Ying
L. Wang
Junjie Mu
Jinyang Guo
...
Yuqing Ma
Siyuan Liang
Mingchuan Zhang
Xianglong Liu
Dacheng Tao
25
0
0
17 Jun 2025
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Christina Q. Knight
Kaustubh Deshpande
Ved Sirdeshmukh
Meher Mankikar
Scale Red Team
SEAL Research Team
Julian Michael
AAML
ELM
36
0
0
17 Jun 2025
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Rohan Gupta
Erik Jenner
27
0
0
17 Jun 2025
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Thomas Kuntz
Agatha Duzan
Hao Zhao
Francesco Croce
Zico Kolter
Nicolas Flammarion
Maksym Andriushchenko
17
0
0
17 Jun 2025
ExtendAttack: Attacking Servers of LRMs via Extending Reasoning
Zhenhao Zhu
Yue Liu
Yingwei Ma
Hongcheng Gao
Nuo Chen
Yanpei Guo
Wenjie Qu
Huiying Xu
Xinzhong Zhu
Jiaheng Zhang
AAML
LRM
25
0
0
16 Jun 2025
Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs
Houcheng Jiang
Zetong Zhao
Junfeng Fang
Haokai Ma
Ruipeng Wang
Yang Deng
Xiang Wang
Xiangnan He
KELM
AAML
27
0
0
16 Jun 2025
Weakest Link in the Chain: Security Vulnerabilities in Advanced Reasoning Models
Arjun Krishna
Aaditya Rastogi
Erick Galinkin
AAML
ELM
LRM
24
0
0
16 Jun 2025
Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments
Xuan Wang
Siyuan Liang
Zhe Liu
Yi Yu
Yuliang Lu
Xiaochun Cao
Ee-Chien Chang
X. Gao
AAML
64
0
0
16 Jun 2025
Jailbreak Strength and Model Similarity Predict Transferability
Rico Angell
Jannik Brinkmann
He He
17
0
0
15 Jun 2025
The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models
Peiyuan Tang
Haojie Xin
Xiaodong Zhang
Jun Sun
Qin Xia
Zijiang Yang
VLM
17
0
0
15 Jun 2025
ContextBench: Modifying Contexts for Targeted Latent Activation
Robert Graham
Edward Stevinson
Leo Richter
Alexander Chia
Joseph Miller
Joseph Isaac Bloom
15
0
0
15 Jun 2025
Universal Jailbreak Suffixes Are Strong Attention Hijackers
Matan Ben-Tov
Mor Geva
Mahmood Sharif
29
0
0
15 Jun 2025
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025
Zonghao Ying
Siyang Wu
Run Hao
Peng Ying
Shixuan Sun
...
Xianglong Liu
Dawn Song
Alan Yuille
Philip Torr
Dacheng Tao
23
0
0
14 Jun 2025
Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Filip Sondej
Yushi Yang
Mikołaj Kniejski
Marcel Windys
MU
32
0
0
14 Jun 2025
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
Taegyeong Lee
Jeonghwa Yoo
Hyoungseo Cho
Soo Yong Kim
Yunho Maeng
AAML
13
0
0
14 Jun 2025
InfoFlood: Jailbreaking Large Language Models with Information Overload
Advait Yadav
Haibo Jin
Man Luo
Jun Zhuang
Haohan Wang
AAML
23
0
0
13 Jun 2025
Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback
Dongwei Jiang
Alvin Zhang
Andrew Wang
Nicholas Andrews
Daniel Khashabi
LRM
27
0
0
13 Jun 2025
From Emergence to Control: Probing and Modulating Self-Reflection in Language Models
Xudong Zhu
Jiachen Jiang
Mohammad Mahdi Khalili
Zhihui Zhu
ReLM
LM&Ro
LRM
51
0
0
13 Jun 2025
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Yeonwoo Jang
Shariqah Hossain
Ashwin Sreevatsa
Diogo Cruz
AAML
MU
49
0
0
11 Jun 2025
LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge
Songze Li
Chuokun Xu
Jiaying Wang
Xueluan Gong
Chen Chen
J. Zhang
Jun Wang
K. Lam
Shouling Ji
AAML
ELM
82
0
0
11 Jun 2025
When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs
Xiao Li
Joel Kreuzwieser
Alan Peters
47
0
0
11 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
88
0
0
11 Jun 2025
Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
Brendan Leigh Ross
Noël Vouitsis
Atiyeh Ashari Ghomi
Rasa Hosseinzadeh
Ji Xin
...
Yi Sui
Shiyi Hou
Kin Kwan Leung
Gabriel Loaiza-Ganem
Jesse C. Cresswell
70
0
0
11 Jun 2025
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
Yunjia Qi
Xiaozhi Wang
Bin Xu
Lei Hou
Juanzi Li
OffRL
75
0
0
11 Jun 2025
Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation
Giacomo Baldan
Qiang Liu
Alberto Guardone
Nils Thuerey
AI4CE
23
1
0
10 Jun 2025
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
Shuo Yang
Qihui Zhang
Yuyang Liu
Yue Huang
Xiaojun Jia
...
Jiayu Yao
Jigang Wang
Hailiang Dai
Yibing Song
Li Yuan
37
0
0
10 Jun 2025
SoK: Machine Unlearning for Large Language Models
Jie Ren
Yue Xing
Yingqian Cui
Charu C. Aggarwal
Hui Liu
MU
42
0
0
10 Jun 2025
SafeCoT: Improving VLM Safety with Minimal Reasoning
Jiachen Ma
Zhanhui Zhou
Chao Yang
Chaochao Lu
LRM
26
0
0
10 Jun 2025
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
Yukai Zhou
Sibei Yang
Wenjie Wang
AAML
17
0
0
09 Jun 2025
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
Yifan Luo
Zhennan Zhou
Bin Dong
17
0
0
09 Jun 2025
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
T. Krauß
Hamid Dashtbani
Alexandra Dmitrienko
17
0
0
09 Jun 2025
JavelinGuard: Low-Cost Transformer Architectures for LLM Security
Yash Datta
Sharath Rajasekar
15
0
0
09 Jun 2025
Mind the Web: The Security of Web Use Agents
Avishag Shapira
Parth A. Gandhi
Edan Habler
Oleg Brodt
A. Shabtai
LLMAG
16
0
0
08 Jun 2025
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
Leheng Sheng
Changshuo Shen
Weixiang Zhao
Junfeng Fang
Xiaohao Liu
Zhenkai Liang
Xiang Wang
An Zhang
Tat-Seng Chua
LLMSV
30
0
0
08 Jun 2025
Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models
Ren-Jian Wang
Ke Xue
Zeyu Qin
Ziniu Li
Sheng Tang
Hao-Tian Li
Shengcai Liu
Chao Qian
AAML
20
0
0
08 Jun 2025
Reward Model Interpretability via Optimal and Pessimal Tokens
Brian Christian
Hannah Rose Kirk
Jessica A.F. Thompson
Christopher Summerfield
Tsvetomira Dumbalska
AAML
17
0
0
08 Jun 2025
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue
Reza Abbasi-Asl
Ramtin Pedarsani
AAML
23
0
0
08 Jun 2025
Tokenized Bandit for LLM Decoding and Alignment
Suho Shin
Chenghao Yang
Haifeng Xu
Mohammad T. Hajiaghayi
26
0
0
08 Jun 2025
Transferring Features Across Language Models With Model Stitching
Alan Chen
Jack Merullo
Alessandro Stolfo
Ellie Pavlick
35
0
0
07 Jun 2025
What Makes a Good Natural Language Prompt?
Do Xuan Long
Duy Dinh
Ngoc-Hai Nguyen
Kenji Kawaguchi
Nancy F. Chen
Shafiq Joty
Min-Yen Kan
33
0
0
07 Jun 2025
1
2
3
4
...
21
22
23
Next