ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
Zhihao Xu
Yongqi Tong
Xin Zhang
Jun Zhou
Xiting Wang
74
0
0
15 Apr 2025
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
Yupei Liu
Yuqi Jia
Jinyuan Jia
Dawn Song
Neil Zhenqiang Gong
AAML
93
3
0
15 Apr 2025
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Kristina Nikolić
Luze Sun
Jie Zhang
F. Tramèr
64
3
0
14 Apr 2025
StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models
StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models
Yang Feng
Xudong Pan
AAML
74
0
0
14 Apr 2025
Ctrl-Z: Controlling AI Agents via Resampling
Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt
Cody Rushing
Adam Kaufman
Tyler Tracy
Vasil Georgiev
David Matolcsi
Akbir Khan
Bo Shen
AAML
59
4
0
14 Apr 2025
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Yanbo Wang
Jiyang Guan
Jian Liang
Ran He
130
0
0
14 Apr 2025
The Structural Safety Generalization Problem
The Structural Safety Generalization Problem
Julius Broomfield
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Tia Nasir
Jason Zhang
Reihaneh Iranmanesh
Sara Pieri
Reihaneh Rabbany
Kellin Pelrine
AAML
102
0
0
13 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAMLLLMSV
103
4
0
13 Apr 2025
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent
Liang-bo Ning
Shijie Wang
Wenqi Fan
Qing Li
Xin Xu
Hao Chen
Feiran Huang
AAML
109
21
0
13 Apr 2025
Detecting Instruction Fine-tuning Attack on Language Models with Influence Function
Detecting Instruction Fine-tuning Attack on Language Models with Influence Function
Jiawei Li
TDIAAML
63
0
0
12 Apr 2025
Feature-Aware Malicious Output Detection and Mitigation
Feature-Aware Malicious Output Detection and Mitigation
Weilong Dong
Peiguang Li
Yu Tian
Xinyi Zeng
Fengdi Li
Sirui Wang
AAML
47
0
0
12 Apr 2025
PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization
PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization
Yang Jiao
Xiao Wang
Kai Yang
AAMLSILM
109
1
0
10 Apr 2025
Geneshift: Impact of different scenario shift on Jailbreaking LLM
Geneshift: Impact of different scenario shift on Jailbreaking LLM
Tianyi Wu
Zhiwei Xue
Yue Liu
Jiaheng Zhang
Bryan Hooi
See-Kiong Ng
102
0
0
10 Apr 2025
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks
Charlotte Siska
Anush Sankaran
AAML
90
1
0
10 Apr 2025
Bypassing Safety Guardrails in LLMs Using Humor
Bypassing Safety Guardrails in LLMs Using Humor
Pedro Cisneros-Velarde
131
1
0
09 Apr 2025
NLP Security and Ethics, in the Wild
NLP Security and Ethics, in the Wild
Heather Lent
Erick Galinkin
Yiyi Chen
Jens Myrup Pedersen
Leon Derczynski
Johannes Bjerva
SILM
135
0
0
09 Apr 2025
Bridging the Gap Between Preference Alignment and Machine Unlearning
Bridging the Gap Between Preference Alignment and Machine Unlearning
Xiaohua Feng
Yuyuan Li
Huwei Ji
Jiaming Zhang
Lulu Zhang
Tianyu Du
Chaochao Chen
MU
93
0
0
09 Apr 2025
Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators
Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators
Xitao Li
Haoran Wang
Jiang Wu
Ting Liu
AAML
65
0
0
08 Apr 2025
StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
Yiming Tang
Yi Fan
Chenxiao Yu
Tiankai Yang
Yue Zhao
Xiyang Hu
131
2
0
08 Apr 2025
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
Jiawei Lian
Jianhong Pan
L. Wang
Yi Wang
Shaohui Mei
Lap-Pui Chau
AAML
137
0
0
07 Apr 2025
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models
Carlos Peláez-González
Andrés Herrera-Poyatos
Cristina Zuheros
David Herrera-Poyatos
Virilo Tejedor
F. Herrera
AAML
78
0
0
07 Apr 2025
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
Vishnu Kabir Chhabra
Mohammad Mahdi Khalili
AI4CE
87
0
0
05 Apr 2025
Rethinking Reflection in Pre-Training
Rethinking Reflection in Pre-Training
Essential AI
Darsh J Shah
Peter Rushton
Somanshu Singla
Mohit Parmar
...
Philip Monk
Platon Mazarakis
Ritvik Kapila
Saurabh Srivastava
Tim Romanski
ReLMLRM
157
14
0
05 Apr 2025
The H-Elena Trojan Virus to Infect Model Weights: A Wake-Up Call on the Security Risks of Malicious Fine-Tuning
The H-Elena Trojan Virus to Infect Model Weights: A Wake-Up Call on the Security Risks of Malicious Fine-Tuning
Virilo Tejedor
Cristina Zuheros
Carlos Peláez-González
David Herrera-Poyatos
Andrés Herrera-Poyatos
F. Herrera
58
0
0
04 Apr 2025
Exploiting Fine-Grained Skip Behaviors for Micro-Video Recommendation
Exploiting Fine-Grained Skip Behaviors for Micro-Video Recommendation
Sanghyuck Lee
Sangkeun Park
Jaesung Lee
84
0
0
04 Apr 2025
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
Yifan Wang
Runjin Chen
Bolian Li
David Cho
Yihe Deng
Ruqi Zhang
Tianlong Chen
Zhangyang Wang
A. Grama
Junyuan Hong
SyDa
72
2
0
03 Apr 2025
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
Tian Jin
Xiao Yu
Ninareh Mehrabi
Rahul Gupta
Zhou Yu
Ruoxi Jia
AAMLLLMAG
105
0
0
02 Apr 2025
Emerging Cyber Attack Risks of Medical AI Agents
Emerging Cyber Attack Risks of Medical AI Agents
Jianing Qiu
Lin Li
Jiankai Sun
Hao Wei
Zhe Xu
K. Lam
Wu Yuan
AAML
112
3
0
02 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization
PiCo: Jailbreaking Multimodal Large Language Models via Pi\textbf{Pi}Pictorial Co\textbf{Co}Code Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLMAAML
182
0
0
02 Apr 2025
Representation Bending for Large Language Model Safety
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAMLALMKELM
129
4
0
02 Apr 2025
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
Erfan Shayegani
G M Shahariar
Sara Abdali
Lei Yu
Nael B. Abu-Ghazaleh
Yue Dong
AAML
142
0
0
01 Apr 2025
CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation
CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation
Chengshuai Zhao
Riccardo De Maria
Tharindu Kumarage
Kumar Satvik Chaudhary
Garima Agrawal
Yiwen Li
Jongchan Park
Yuli Deng
Yiran Chen
Huan Liu
82
0
0
01 Apr 2025
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms
Shuoming Zhang
Jiacheng Zhao
Ruiyuan Xu
Xiaobing Feng
Huimin Cui
AAML
90
3
0
31 Mar 2025
DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents
DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents
Shiyi Yang
Zhibo Hu
Xinshu Li
Chen Wang
Tong Yu
Xiwei Xu
Liming Zhu
Lina Yao
AAML
105
0
0
31 Mar 2025
$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Agents Under Siege\textit{Agents Under Siege}Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Rana Muhammad Shahroz Khan
Zhen Tan
Sukwon Yun
Charles Flemming
Tianlong Chen
AAMLLLMAG
Presented at ResearchTrend Connect | LLMAG on 23 Apr 2025
196
4
0
31 Mar 2025
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
Runpeng Dai
Run Yang
Fan Zhou
Hongtu Zhu
60
0
0
28 Mar 2025
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Chung-En Sun
Ge Yan
Tsui-Wei Weng
KELMLRM
102
3
0
27 Mar 2025
Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models
Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models
Shih-Wen Ke
Guan-Yu Lai
Guo-Lin Fang
Hsi-Yuan Kao
SILM
155
0
0
26 Mar 2025
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Joonhyun Jeong
Seyun Bae
Yeonsung Jung
Jaeryong Hwang
Eunho Yang
AAML
105
2
0
26 Mar 2025
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
Zhiyao Ren
Yibing Zhan
B. Yu
Dacheng Tao
DiffM
95
0
0
25 Mar 2025
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
Luyao Tang
Yuxuan Yuan
Chen Chen
Zeyu Zhang
Yue Huang
Kun Zhang
98
1
0
24 Mar 2025
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Yalan Qin
Xiuying Chen
Rui Pan
Han Zhu
Chen Zhang
...
Chi-Min Chan
Sirui Han
Yike Guo
Yiran Yang
Yaodong Yang
OffRL
146
4
0
22 Mar 2025
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
Jian Guan
Jian Wu
Jia-Nan Li
Chuanqi Cheng
Wei Wu
LM&MA
171
3
0
21 Mar 2025
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera
S. Kadhe
Farhan Ahmed
Syed Zawad
Holger Boche
MoMe
90
4
0
21 Mar 2025
Towards LLM Guardrails via Sparse Representation Steering
Towards LLM Guardrails via Sparse Representation Steering
Zeqing He
Peng Kuang
Huiyu Xu
Kui Ren
LLMSV
88
2
0
21 Mar 2025
In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
Shayne Longpre
Kevin Klyman
Ruth E. Appel
Sayash Kapoor
Rishi Bommasani
...
Victoria Westerhoff
Yacine Jernite
Rumman Chowdhury
Percy Liang
Arvind Narayanan
ELM
97
1
0
21 Mar 2025
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
Aly M. Kassem
Bernhard Schölkopf
Zhijing Jin
42
1
0
20 Mar 2025
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Andy Zhou
Kevin E. Wu
Francesco Pinto
Zhongfu Chen
Yi Zeng
Yu Yang
Shuang Yang
Sanmi Koyejo
James Zou
Bo Li
LLMAGAAML
136
1
0
20 Mar 2025
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Zonghao Ying
Guangyi Zheng
Yongxin Huang
Deyue Zhang
Wenxin Zhang
Quanchen Zou
Aishan Liu
Xianglong Liu
Dacheng Tao
ELM
155
13
0
19 Mar 2025
Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents
Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents
Juhee Kim
Woohyuk Choi
Byoungyoung Lee
LLMAG
138
1
0
17 Mar 2025
Previous
123456...212223
Next