Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 951 papers shown
Title
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
Jonathan Nöther
Adish Singla
Goran Radanović
AAML
62
0
0
14 Jan 2025
Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models
Dhruv Dhamani
Mary Lou Maher
LLMAG
71
0
0
14 Jan 2025
Lessons From Red Teaming 100 Generative AI Products
Blake Bullwinkel
Amanda Minnich
Shiven Chawla
Gary Lopez
Martin Pouliot
...
Pete Bryan
Ram Shankar Siva Kumar
Yonatan Zunger
Chang Kawaguchi
Mark Russinovich
AAML
VLM
44
5
0
13 Jan 2025
Safeguarding System Prompts for LLMs
Zhifeng Jiang
Zhihua Jin
Guoliang He
AAML
SILM
113
1
0
10 Jan 2025
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
Fengxiang Wang
Ranjie Duan
Peng Xiao
Xiaojun Jia
Shiji Zhao
...
Hang Su
Jialing Tao
Hui Xue
Jun Zhu
Hui Xue
LLMAG
64
7
0
08 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
81
7
0
08 Jan 2025
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Yang Ouyang
Hengrui Gu
Shuhang Lin
Wenyue Hua
Jie Peng
B. Kailkhura
Tianlong Chen
Kaixiong Zhou
Kaixiong Zhou
AAML
31
2
0
05 Jan 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Sihang Li
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
67
0
0
03 Jan 2025
Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines
Xiyang Hu
AAML
36
1
0
01 Jan 2025
GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search
Matan Ben-Tov
Mahmood Sharif
RALM
40
1
0
31 Dec 2024
Enhancing AI Safety Through the Fusion of Low Rank Adapters
Satya Swaroop Gudipudi
Sreeram Vipparla
Harpreet Singh
Shashwat Goel
Ponnurangam Kumaraguru
MoMe
AAML
46
2
0
30 Dec 2024
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
4
0
24 Dec 2024
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang
Hao Li
Junda Zhu
Xinyuan Wang
Changzai Pan
Minlie Huang
Lei Sha
186
0
0
23 Dec 2024
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum
Christoph Schnabl
Gabor Hollbeck
Silas Alberti
Philip Blinde
Marvin von Hagen
92
2
0
22 Dec 2024
The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents
Feiran Jia
Tong Wu
Xin Qin
Anna Squicciarini
LLMAG
AAML
98
4
0
21 Dec 2024
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
72
0
0
19 Dec 2024
Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation
Minkyoung Kim
Yunha Kim
Hyeram Seo
Heejung Choi
Jiye Han
...
Hyoje Jung
Byeolhee Kim
Young-Hak Kim
Sanghyun Park
Tae Joon Jun
AAML
86
0
0
18 Dec 2024
Adversarial Hubness in Multi-Modal Retrieval
Tingwei Zhang
Fnu Suya
Rishi Jha
Collin Zhang
Vitaly Shmatikov
AAML
90
1
0
18 Dec 2024
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes
Marco Christiani
David Shriver
Marissa Connor
KELM
85
1
0
17 Dec 2024
Jailbreaking? One Step Is Enough!
Weixiong Zheng
Peijian Zeng
Yuchen Li
Hongyan Wu
Nankai Lin
Jianfei Chen
Aimin Yang
Yue Zhou
AAML
86
0
0
17 Dec 2024
LLMs Can Simulate Standardized Patients via Agent Coevolution
Zhuoyun Du
Lujie Zheng
Renjun Hu
Yuyang Xu
Xiaochen Li
Ying Sun
Wei Chen
Jian Wu
Haolei Cai
Haohao Ying
LM&MA
83
3
0
16 Dec 2024
The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang
Yingkang Wang
Shiyao Cui
Pei Ke
J. Tang
123
1
0
15 Dec 2024
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
Zhiyu Xue
Guangliang Liu
Bocheng Chen
K. Johnson
Ramtin Pedarsani
AAML
73
0
0
13 Dec 2024
FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
Bocheng Chen
Hanqing Guo
Qiben Yan
AAML
84
0
0
10 Dec 2024
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation
Xuying Li
Zhuo Li
Yuji Kosuga
Yasuhiro Yoshida
Victor Bian
AAML
94
2
0
05 Dec 2024
Time-Reversal Provides Unsupervised Feedback to LLMs
Yerram Varun
Rahul Madhavan
Sravanti Addepalli
A. Suggala
Karthikeyan Shanmugam
Prateek Jain
LRM
SyDa
71
0
0
03 Dec 2024
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Erick Galinkin
Martin Sablotny
83
0
0
02 Dec 2024
Yi-Lightning Technical Report
01. AI
:
Alan Wake
Albert Wang
Bei Chen
...
Yuxuan Sha
Zhaodong Yan
Zhiyuan Liu
Zirui Zhang
Zonghong Dai
OSLM
102
3
0
02 Dec 2024
Quantized Delta Weight Is Safety Keeper
Yule Liu
Zhen Sun
Xinlei He
Xinyi Huang
96
2
0
29 Nov 2024
On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code
Md. Imran Hossen
X. Hei
AAML
ELM
74
0
0
29 Nov 2024
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
Shenghui Li
Edith C.H. Ngai
Fanghua Ye
Thiemo Voigt
SILM
90
6
0
28 Nov 2024
Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication
Davide Garassino
Vivana Masia
Nicola Brocca
Alice Delorme Benites
69
0
0
27 Nov 2024
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao
Bryan Hooi
Qingbin Liu
Kai-Wei Chang
Zi Huang
Yujun Cai
AAML
102
1
0
27 Nov 2024
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
Zhi-Yi Chin
Kuan-Chen Mu
Mario Fritz
Pin-Yu Chen
DiffM
90
0
0
25 Nov 2024
RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks
Changyue Jiang
Xudong Pan
Geng Hong
Chenfu Bao
Min Yang
SILM
77
10
0
21 Nov 2024
Global Challenge for Safe and Secure LLMs Track 1
Xiaojun Jia
Yihao Huang
Yang Liu
Peng Yan Tan
Weng Kuan Yau
...
Yan Wang
Rick Siow Mong Goh
Liangli Zhen
Yingjie Zhang
Zhe Zhao
ELM
AILaw
81
0
0
21 Nov 2024
Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation
Ke Zhao
Huayang Huang
Miao Li
Yu Wu
AAML
71
1
0
21 Nov 2024
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Nay Myat Min
Long H. Pham
Yige Li
Jun Sun
AAML
69
4
0
18 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
62
3
0
17 Nov 2024
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Jianfeng Chi
Ujjwal Karn
Hongyuan Zhan
Eric Michael Smith
Javier Rando
Yiming Zhang
Kate Plawiak
Zacharie Delpierre Coudert
Kartikeya Upasani
Mahesh Pasupuleti
MLLM
3DH
62
22
0
15 Nov 2024
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey
Xuannan Liu
Xing Cui
Peipei Li
Zekun Li
Huaibo Huang
Shuhan Xia
Miaoxuan Zhang
Yueying Zou
Ran He
AAML
67
8
0
14 Nov 2024
DROJ: A Prompt-Driven Attack against Large Language Models
Leyang Hu
Boran Wang
34
0
0
14 Nov 2024
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAML
ELM
PILM
69
1
0
12 Nov 2024
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
Yannis Belkhiter
Giulio Zizzo
S. Maffeis
46
1
0
11 Nov 2024
A Survey of AI-Related Cyber Security Risks and Countermeasures in Mobility-as-a-Service
Kai-Fung Chu
Haiyue Yuan
Jinsheng Yuan
Weisi Guo
Nazmiye Balta-Ozkan
Shujun Li
44
2
0
08 Nov 2024
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models
Saketh Bachu
Erfan Shayegani
Trishna Chakraborty
Rohit Lal
Arindam Dutta
Chengyu Song
Yue Dong
Nael B. Abu-Ghazaleh
Amit K. Roy-Chowdhury
36
0
0
06 Nov 2024
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
230
1
0
06 Nov 2024
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
Jason Vega
Junsheng Huang
Gaokai Zhang
Hangoo Kang
Minjia Zhang
Gagandeep Singh
39
0
0
05 Nov 2024
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang
Mengxi Gao
Yibo Yan
Xin Zou
Yanggan Gu
Aiwei Liu
Xuming Hu
46
5
0
05 Nov 2024
Attacking Vision-Language Computer Agents via Pop-ups
Yanzhe Zhang
Tao Yu
Diyi Yang
AAML
VLM
40
21
0
04 Nov 2024
Previous
1
2
3
4
5
6
...
18
19
20
Next