ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
143
1
0
15 Oct 2024
Cognitive Overload Attack:Prompt Injection for Long Context
Cognitive Overload Attack:Prompt Injection for Long Context
Bibek Upadhayay
Vahid Behzadan
Amin Karbasi
AAML
93
2
0
15 Oct 2024
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Furong Huang
SILMAAML
114
0
0
15 Oct 2024
Locking Down the Finetuned LLMs Safety
Locking Down the Finetuned LLMs Safety
Minjun Zhu
Linyi Yang
Yifan Wei
Ningyu Zhang
Yue Zhang
108
14
0
14 Oct 2024
On Calibration of LLM-based Guard Models for Reliable Content Moderation
On Calibration of LLM-based Guard Models for Reliable Content Moderation
Hongfu Liu
Hengguan Huang
Hao Wang
Xiangming Gu
Ye Wang
188
4
0
14 Oct 2024
BlackDAN: A Black-Box Multi-Objective Approach for Effective and
  Contextual Jailbreaking of Large Language Models
BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
Xinyuan Wang
Victor Shea-Jay Huang
Renmiao Chen
Hao Wang
Changzai Pan
Lei Sha
Minlie Huang
AAML
77
2
0
13 Oct 2024
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
Enyu Zhou
Guodong Zheng
Binghai Wang
Zhiheng Xi
Shihan Dou
...
Yurong Mou
Rui Zheng
Tao Gui
Qi Zhang
Xuanjing Huang
ALM
152
21
0
13 Oct 2024
Are You Human? An Adversarial Benchmark to Expose LLMs
Are You Human? An Adversarial Benchmark to Expose LLMs
Gilad Gressel
Rahul Pankajakshan
Yisroel Mirsky
DeLMO
77
2
0
12 Oct 2024
Can a large language model be a gaslighter?
Can a large language model be a gaslighter?
Wei Li
Luyao Zhu
Yang Song
Ruixi Lin
Rui Mao
Yang You
52
0
0
11 Oct 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention
  Manipulation
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Yanjie Wang
Cihang Xie
57
9
0
11 Oct 2024
On the Adversarial Transferability of Generalized "Skip Connections"
On the Adversarial Transferability of Generalized "Skip Connections"
Yisen Wang
Yichuan Mo
Dongxian Wu
Mingjie Li
Xingjun Ma
Zhouchen Lin
AAML
72
2
0
11 Oct 2024
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent
  Enhanced Explanation Evaluation Framework
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
Fan Liu
Yue Feng
Zhao Xu
Lixin Su
Xinyu Ma
D. Yin
Hao Liu
ELM
105
15
0
11 Oct 2024
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt
  Decomposition Process
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
Xiaogeng Liu
Chaowei Xiao
AAML
62
4
0
11 Oct 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Priyanshu Kumar
Elaine Lau
Saranya Vijayakumar
Tu Trinh
Scale Red Team
...
Sean Hendryx
Shuyan Zhou
Matt Fredrikson
Summer Yue
Zifan Wang
LLMAG
93
26
0
11 Oct 2024
Do Unlearning Methods Remove Information from Language Model Weights?
Do Unlearning Methods Remove Information from Language Model Weights?
Aghyad Deeb
Fabien Roger
AAMLMU
113
29
0
11 Oct 2024
MergePrint: Merge-Resistant Fingerprints for Robust Black-box Ownership Verification of Large Language Models
MergePrint: Merge-Resistant Fingerprints for Robust Black-box Ownership Verification of Large Language Models
Shojiro Yamabe
Futa Waseda
Tsubasa Takahashi
Koki Wataoka
MoMe
138
1
0
11 Oct 2024
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Jingyu Zhang
Ahmed Elgohary
Ahmed Magooda
Daniel Khashabi
Benjamin Van Durme
470
8
0
11 Oct 2024
Steering Masked Discrete Diffusion Models via Discrete Denoising
  Posterior Prediction
Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction
Jarrid Rector-Brooks
Mohsin Hasan
Zhangzhi Peng
Zachary Quinn
Chenghao Liu
...
Michael Bronstein
Yoshua Bengio
Pranam Chatterjee
Alexander Tong
Avishek Joey Bose
DiffM
108
12
0
10 Oct 2024
Towards Assurance of LLM Adversarial Robustness using Ontology-Driven
  Argumentation
Towards Assurance of LLM Adversarial Robustness using Ontology-Driven Argumentation
Tomas Bueno Momcilovic
Beat Buesser
Giulio Zizzo
Mark Purcell
Tomas Bueno Momcilovic
AAML
60
2
0
10 Oct 2024
How Does Vision-Language Adaptation Impact the Safety of Vision Language
  Models?
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
Seongyun Lee
Geewook Kim
Jiyeon Kim
Hyunji Lee
Hoyeon Chang
Sue Hyun Park
Minjoon Seo
84
1
0
10 Oct 2024
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act
Philipp Guldimann
Alexander Spiridonov
Robin Staab
Nikola Jovanović
Mark Vero
...
Mislav Balunović
Nikola Konstantinov
Pavol Bielik
Petar Tsankov
Martin Vechev
ELM
103
8
0
10 Oct 2024
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Qingni Wang
Tiantian Geng
Zhiyuan Wang
Teng Wang
Bo Fu
Feng Zheng
192
5
0
10 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
D. Yin
Pengjie Ren
Maarten de Rijke
Zhaochun Ren
155
10
0
10 Oct 2024
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
Han Shen
Pin-Yu Chen
Payel Das
Tianyi Chen
ALM
122
23
0
09 Oct 2024
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
Donghyun Lee
Mo Tiwari
LLMAG
79
25
0
09 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
173
1
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
89
13
0
09 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
493
4
0
09 Oct 2024
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Yi Ding
Bolian Li
Ruqi Zhang
MLLM
138
15
0
09 Oct 2024
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu
Shujian Zhang
Kaiqiang Song
Silei Xu
Sanqiang Zhao
Ravi Agrawal
Sathish Indurthi
Chong Xiang
Prateek Mittal
Wenxuan Zhou
112
14
0
09 Oct 2024
Superficial Safety Alignment Hypothesis
Superficial Safety Alignment Hypothesis
Jianwei Li
Jung-Eun Kim
65
3
0
07 Oct 2024
Collaboration! Towards Robust Neural Methods for Routing Problems
Collaboration! Towards Robust Neural Methods for Routing Problems
Jianan Zhou
Yaoxin Wu
Zhiguang Cao
Wen Song
Jie Zhang
Zhiqi Shen
AAML
79
3
0
07 Oct 2024
Latent Feature Mining for Predictive Model Enhancement with Large
  Language Models
Latent Feature Mining for Predictive Model Enhancement with Large Language Models
Bingxuan Li
Pengyi Shi
Amy Ward
130
0
0
06 Oct 2024
Harnessing Task Overload for Scalable Jailbreak Attacks on Large
  Language Models
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
Yiting Dong
Guobin Shen
Dongcheng Zhao
Xiang He
Yi Zeng
75
2
0
05 Oct 2024
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Zi Wang
Divyam Anshumaan
Ashish Hooda
Yudong Chen
Somesh Jha
AAML
96
0
0
05 Oct 2024
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
Tianyu Wu
Lingrui Mei
Ruibin Yuan
Lujun Li
Wei Xue
Yike Guo
77
2
0
04 Oct 2024
Towards Assuring EU AI Act Compliance and Adversarial Robustness of LLMs
Towards Assuring EU AI Act Compliance and Adversarial Robustness of LLMs
Tomas Bueno Momcilovic
Beat Buesser
Giulio Zizzo
Mark Purcell
Dian Balta
AAML
66
3
0
04 Oct 2024
Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial
  Robustness of LLMs
Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial Robustness of LLMs
Tomas Bueno Momcilovic
Dian Balta
Beat Buesser
Giulio Zizzo
Mark Purcell
AAML
87
1
0
04 Oct 2024
Developing Assurance Cases for Adversarial Robustness and Regulatory
  Compliance in LLMs
Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs
Tomas Bueno Momcilovic
Dian Balta
Beat Buesser
Giulio Zizzo
Mark Purcell
AAML
68
0
0
04 Oct 2024
Gradient-based Jailbreak Images for Multimodal Fusion Models
Gradient-based Jailbreak Images for Multimodal Fusion Models
Javier Rando
Hannah Korevaar
Erik Brinkman
Ivan Evtimov
Florian Tramèr
AAML
78
3
0
04 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
147
11
0
04 Oct 2024
Permissive Information-Flow Analysis for Large Language Models
Permissive Information-Flow Analysis for Large Language Models
Shoaib Ahmed Siddiqui
Radhika Gaonkar
Boris Köpf
David M. Krueger
Andrew Paverd
Ahmed Salem
Shruti Tople
Lukas Wutschitz
Menglin Xia
Santiago Zanella Béguelin
135
2
0
04 Oct 2024
Output Scouting: Auditing Large Language Models for Catastrophic Responses
Output Scouting: Auditing Large Language Models for Catastrophic Responses
Andrew Bell
Joao Fonseca
KELM
145
2
0
04 Oct 2024
HiddenGuard: Fine-Grained Safe Generation with Specialized
  Representation Router
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Ruibin Yuan
Xueqi Cheng
113
5
0
03 Oct 2024
Hate Personified: Investigating the role of LLMs in content moderation
Hate Personified: Investigating the role of LLMs in content moderation
Sarah Masud
Sahajpreet Singh
Viktor Hangya
Alexander Fraser
Tanmoy Chakraborty
63
9
0
03 Oct 2024
Erasing Conceptual Knowledge from Language Models
Erasing Conceptual Knowledge from Language Models
Rohit Gandikota
Sheridan Feucht
Samuel Marks
David Bau
KELMELMMU
129
11
0
03 Oct 2024
Undesirable Memorization in Large Language Models: A Survey
Undesirable Memorization in Large Language Models: A Survey
Ali Satvaty
Suzan Verberne
Fatih Turkmen
ELMPILM
196
7
0
03 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
118
4
0
03 Oct 2024
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Xiaogeng Liu
Peiran Li
Edward Suh
Yevgeniy Vorobeychik
Zhuoqing Mao
Somesh Jha
Patrick McDaniel
Huan Sun
Bo Li
Chaowei Xiao
131
32
0
03 Oct 2024
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Maya Pavlova
Erik Brinkman
Krithika Iyer
Vítor Albiero
Joanna Bitton
Hailey Nguyen
Jingkai Li
Cristian Canton Ferrer
Ivan Evtimov
Aaron Grattafiori
ALM
72
12
0
02 Oct 2024
Previous
123...91011...212223
Next