ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
Dillon Bowen
Ann-Kathrin Dombrowski
Adam Gleave
Chris Cundy
ELM
75
0
0
17 Mar 2025
Augmented Adversarial Trigger Learning
Augmented Adversarial Trigger Learning
Zhe Wang
Yanjun Qi
92
0
0
16 Mar 2025
Empirical Privacy Variance
Empirical Privacy Variance
Yuzheng Hu
Fan Wu
Ruicheng Xian
Yuhang Liu
Lydia Zakynthinou
Pritish Kamath
Chiyuan Zhang
David A. Forsyth
151
0
0
16 Mar 2025
Safe Vision-Language Models via Unsafe Weights Manipulation
Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno DÍncà
E. Peruzzo
Xingqian Xu
Humphrey Shi
N. Sebe
Massimiliano Mancini
MU
116
0
0
14 Mar 2025
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning
Yiwei Chen
Yuguang Yao
Yihua Zhang
Bingquan Shen
Gaowen Liu
Sijia Liu
AAMLMU
117
2
0
14 Mar 2025
Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
Andy Zhou
Ron Arel
MU
145
0
0
13 Mar 2025
Rethinking Prompt-based Debiasing in Large Language Models
Xinyi Yang
Runzhe Zhan
Derek F. Wong
Shu Yang
Junchao Wu
Lidia S. Chao
ALM
179
1
0
12 Mar 2025
Backtracking for Safety
Bilgehan Sel
Dingcheng Li
Phillip Wallis
Vaishakh Keshava
Ming Jin
Siddhartha Reddy Jonnalagadda
KELM
96
0
0
11 Mar 2025
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader
Nicholas Meade
Siva Reddy
143
1
0
11 Mar 2025
Generating Robot Constitutions & Benchmarks for Semantic Safety
P. Sermanet
Anirudha Majumdar
A. Irpan
Dmitry Kalashnikov
Vikas Sindhwani
LM&Ro
166
3
0
11 Mar 2025
Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation
Wenlong Meng
Fan Zhang
Wendao Yao
Zhenyuan Guo
Yongqian Li
Chengkun Wei
Wenzhi Chen
AAML
120
5
0
11 Mar 2025
CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation
Runqi Sui
AAML
86
1
0
10 Mar 2025
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs
Wenzhuo Xu
Zhipeng Wei
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Xinming Zhang
AAML
90
0
0
10 Mar 2025
Safety Guardrails for LLM-Enabled Robots
Zachary Ravichandran
Alexander Robey
Vijay Kumar
George Pappas
Hamed Hassani
120
5
0
10 Mar 2025
Trustworthy Machine Learning via Memorization and the Granular Long-Tail: A Survey on Interactions, Tradeoffs, and Beyond
Qiongxiu Li
Xiaoyu Luo
Yiyi Chen
Johannes Bjerva
239
2
0
10 Mar 2025
Life-Cycle Routing Vulnerabilities of LLM Router
Qiqi Lin
Xiaoyang Ji
Shengfang Zhai
Qingni Shen
Zhi-Li Zhang
Yuejian Fang
Yansong Gao
AAML
90
1
0
09 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
137
1
0
08 Mar 2025
ToxicSQL: Migrating SQL Injection Threats into Text-to-SQL Models via Backdoor Attack
ToxicSQL: Migrating SQL Injection Threats into Text-to-SQL Models via Backdoor Attack
Meiyu Lin
Haichuan Zhang
Jiale Lao
Renyuan Li
Yuanchun Zhou
Carl Yang
Yang Cao
Mingjie Tang
SILM
173
0
0
07 Mar 2025
Jailbreaking is (Mostly) Simpler Than You Think
M. Russinovich
Ahmed Salem
AAML
106
0
0
07 Mar 2025
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
Erik Jones
Arjun Patrawala
Jacob Steinhardt
72
1
0
06 Mar 2025
SafeArena: Evaluating the Safety of Autonomous Web Agents
Ada Defne Tur
Nicholas Meade
Xing Han Lù
Alejandra Zambrano
Arkil Patel
Esin Durmus
Spandana Gella
Karolina Stañczak
Siva Reddy
LLMAGELM
163
10
0
06 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
172
5
0
05 Mar 2025
Improving LLM Safety Alignment with Dual-Objective Optimization
Improving LLM Safety Alignment with Dual-Objective Optimization
Xuandong Zhao
Will Cai
Tianneng Shi
David Huang
Licong Lin
Song Mei
Dawn Song
AAMLMU
207
5
0
05 Mar 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALMELM
485
2
0
04 Mar 2025
Adversarial Tokenization
Adversarial Tokenization
Renato Lui Geh
Zilei Shao
Guy Van den Broeck
SILMAAML
157
2
0
04 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
469
0
0
03 Mar 2025
Position: Ensuring mutual privacy is necessary for effective external evaluation of proprietary AI systems
Ben Bucknall
Robert F. Trager
Michael A. Osborne
107
0
0
03 Mar 2025
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
Meghana Arakkal Rajeev
Rajkumar Ramamurthy
Prapti Trivedi
Vikas Yadav
Oluwanifemi Bamgbose
Sathwik Tejaswi Madhusudan
James Zou
Nazneen Rajani
AAMLLRM
92
3
0
03 Mar 2025
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts
Ziyi Zhang
Zhen Sun
Zheng Zhang
Jihui Guo
Xinlei He
AAML
139
4
0
28 Feb 2025
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning
J.N. Zhang
Shuang Yang
B. Li
AAMLLLMAG
83
2
0
28 Feb 2025
À la recherche du sens perdu: your favourite LLM might have more to say than you can understand
K. O. T. Erziev
94
0
0
28 Feb 2025
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Weixiang Zhao
Yulin Hu
Yang Deng
Jiahe Guo
Xingyu Sui
...
An Zhang
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
177
7
0
28 Feb 2025
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu
Alexander Robey
Changliu Liu
AAMLLLMSV
105
2
0
28 Feb 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xinsong Zhang
AAML
385
1
0
27 Feb 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
474
1
0
27 Feb 2025
Shh, don't say that! Domain Certification in LLMs
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Guohao Li
Philip Torr
Adel Bibi
114
2
0
26 Feb 2025
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Shuyi Liu
Simiao Cui
Haoran Bu
Yuming Shang
Xi Zhang
ELM
85
1
0
26 Feb 2025
Automatic Prompt Optimization via Heuristic Search: A Survey
Automatic Prompt Optimization via Heuristic Search: A Survey
Wendi Cui
Jiaxin Zhang
Zechao Li
Hao Sun
Damien Lopez
Kamalika Das
Bradley Malin
Sricharan Kumar
93
1
0
26 Feb 2025
Model Lakes
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
176
2
0
24 Feb 2025
On the Robustness of Transformers against Context Hijacking for Linear Classification
On the Robustness of Transformers against Context Hijacking for Linear Classification
Tianle Li
Chenyang Zhang
Xingwu Chen
Yuan Cao
Difan Zou
126
2
0
24 Feb 2025
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
Simon Geisler
Tom Wollschlager
M. H. I. Abdalla
Vincent Cohen-Addad
Johannes Gasteiger
Stephan Günnemann
AAML
128
3
0
24 Feb 2025
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Zhexin Zhang
Leqi Lei
Junxiao Yang
Xijie Huang
Yida Lu
...
Xianqi Lei
Changzai Pan
Lei Sha
Han Wang
Minlie Huang
AAML
125
4
0
24 Feb 2025
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschlager
Jannes Elstner
Simon Geisler
Vincent Cohen-Addad
Stephan Günnemann
Johannes Gasteiger
LLMSV
118
7
0
24 Feb 2025
GuidedBench: Equipping Jailbreak Evaluation with Guidelines
GuidedBench: Equipping Jailbreak Evaluation with Guidelines
Ruixuan Huang
Xunguang Wang
Zongjie Li
Daoyuan Wu
Shuai Wang
ALMELM
129
0
0
24 Feb 2025
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Giulio Zizzo
Giandomenico Cornacchia
Kieran Fraser
Muhammad Zaid Hameed
Ambrish Rawat
Beat Buesser
Mark Purcell
Pin-Yu Chen
P. Sattigeri
Kush R. Varshney
AAML
112
5
0
24 Feb 2025
Unified Prompt Attack Against Text-to-Image Generation Models
Unified Prompt Attack Against Text-to-Image Generation Models
Duo Peng
Qiuhong Ke
Mark He Huang
Ping Hu
Jing Liu
89
1
0
23 Feb 2025
Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System
Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System
Saikat Barua
Mostafizur Rahman
Md Jafor Sadek
Rafiul Islam
Shehnaz Khaled
Ahmedul Kabir
LLMAG
168
1
0
23 Feb 2025
Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models
Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models
Yuyi Huang
Runzhe Zhan
Derek F. Wong
Lidia S. Chao
Ailin Tao
AAMLSyDaELM
72
0
0
23 Feb 2025
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
Xianglong Liu
Siyuan Liang
M. Han
Yong Luo
Aishan Liu
Xiantao Cai
Zheng He
Dacheng Tao
AAMLSILMELM
100
2
0
22 Feb 2025
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging
Lin Lu
Zhigang Zuo
Ziji Sheng
Pan Zhou
MoMe
131
0
0
22 Feb 2025
Previous
123...567...212223
Next