ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
112
205
0
26 Sep 2023
Can LLM-Generated Misinformation Be Detected?
Can LLM-Generated Misinformation Be Detected?
Canyu Chen
Kai Shu
DeLMO
195
182
0
25 Sep 2023
ALLURE: Auditing and Improving LLM-based Evaluation of Text using
  Iterative In-Context-Learning
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning
Hosein Hasanbeig
Hiteshi Sharma
Leo Betthauser
Felipe Vieira Frujeri
Ida Momennejad
110
16
0
24 Sep 2023
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Tianle Li
Siyuan Zhuang
...
Zi Lin
Eric P. Xing
Joseph E. Gonzalez
Ion Stoica
Haotong Zhang
126
221
0
21 Sep 2023
Knowledge Sanitization of Large Language Models
Knowledge Sanitization of Large Language Models
Yoichi Ishibashi
Hidetoshi Shimodaira
KELM
129
25
0
21 Sep 2023
How Robust is Google's Bard to Adversarial Image Attacks?
How Robust is Google's Bard to Adversarial Image Attacks?
Yinpeng Dong
Huanran Chen
Jiawei Chen
Zhengwei Fang
Xiaohu Yang
Yichi Zhang
Yu Tian
Hang Su
Jun Zhu
AAML
118
116
0
21 Sep 2023
Model Leeching: An Extraction Attack Targeting LLMs
Model Leeching: An Extraction Attack Targeting LLMs
Lewis Birch
William Hackett
Stefan Trawicki
N. Suri
Peter Garraghan
83
13
0
19 Sep 2023
LLM Platform Security: Applying a Systematic Evaluation Framework to
  OpenAI's ChatGPT Plugins
LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins
Umar Iqbal
Tadayoshi Kohno
Franziska Roesner
ELMSILM
126
49
0
19 Sep 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated
  Jailbreak Prompts
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
230
353
0
19 Sep 2023
Understanding Catastrophic Forgetting in Language Models via Implicit
  Inference
Understanding Catastrophic Forgetting in Language Models via Implicit Inference
Suhas Kotha
Jacob Mitchell Springer
Aditi Raghunathan
CLL
128
71
0
18 Sep 2023
Are You Worthy of My Trust?: A Socioethical Perspective on the Impacts
  of Trustworthy AI Systems on the Environment and Human Society
Are You Worthy of My Trust?: A Socioethical Perspective on the Impacts of Trustworthy AI Systems on the Environment and Human Society
Jamell Dacon
SILM
100
1
0
18 Sep 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
96
152
0
18 Sep 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language
  Models that Follow Instructions
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Federico Bianchi
Mirac Suzgun
Giuseppe Attanasio
Paul Röttger
Dan Jurafsky
Tatsunori Hashimoto
James Zou
ALMLM&MALRM
96
219
0
14 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning
RAIN: Your Language Models Can Align Themselves without Finetuning
Yuhui Li
Fangyun Wei
Jinjing Zhao
Chao Zhang
Hongyang R. Zhang
SILM
104
118
0
13 Sep 2023
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
  and Ethics
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Haoqin Tu
Bingchen Zhao
Chen Wei
Cihang Xie
MLLM
69
15
0
13 Sep 2023
Using Reed-Muller Codes for Classification with Rejection and Recovery
Using Reed-Muller Codes for Classification with Rejection and Recovery
Daniel Fentham
David Parker
Mark Ryan
45
0
0
12 Sep 2023
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large
  Language Models
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
Arka Dutta
Adel Khorramrouz
Sujan Dutta
Ashiqur R. KhudaBukhsh
24
0
0
08 Sep 2023
Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation
  Strategies towards Equal Long-term Benefit Rate
Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate
Yuancheng Xu
Chenghao Deng
Yanchao Sun
Ruijie Zheng
Xiyao Wang
Jieyu Zhao
Furong Huang
81
4
0
07 Sep 2023
Framework-Based Qualitative Analysis of Free Responses of Large Language
  Models: Algorithmic Fidelity
Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic Fidelity
A. Amirova
T. Fteropoulli
Nafiso Ahmed
Martin R. Cowie
Joel Z Leibo
89
11
0
06 Sep 2023
Demystifying RCE Vulnerabilities in LLM-Integrated Apps
Demystifying RCE Vulnerabilities in LLM-Integrated Apps
Tong Liu
Zizhuang Deng
Guozhu Meng
Yuekang Li
Kai Chen
SILM
163
24
0
06 Sep 2023
Certifying LLM Safety against Adversarial Prompting
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
Soheil Feizi
Himabindu Lakkaraju
AAML
153
197
0
06 Sep 2023
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Raz Lapid
Ron Langberg
Moshe Sipper
AAML
135
112
0
04 Sep 2023
Siren's Song in the AI Ocean: A Survey on Hallucination in Large
  Language Models
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang
Yafu Li
Leyang Cui
Deng Cai
Lemao Liu
...
Longyue Wang
Anh Tuan Luu
Wei Bi
Freda Shi
Shuming Shi
RALMLRMHILM
156
582
0
03 Sep 2023
Robust and Efficient Interference Neural Networks for Defending Against
  Adversarial Attacks in ImageNet
Robust and Efficient Interference Neural Networks for Defending Against Adversarial Attacks in ImageNet
Yunuo Xiong
Shujuan Liu
H. Xiong
AAML
39
0
0
03 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language
  Models
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
176
410
0
01 Sep 2023
Why do universal adversarial attacks work on large language models?:
  Geometry might be the answer
Why do universal adversarial attacks work on large language models?: Geometry might be the answer
Varshini Subhash
Anna Bialas
Weiwei Pan
Finale Doshi-Velez
AAML
91
11
0
01 Sep 2023
Image Hijacks: Adversarial Images can Control Generative Models at
  Runtime
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Luke Bailey
Euan Ong
Stuart J. Russell
Scott Emmons
VLMMLLM
93
88
0
01 Sep 2023
Large language models in medicine: the potentials and pitfalls
Large language models in medicine: the potentials and pitfalls
J. Omiye
Haiwen Gui
Shawheen J. Rezaei
James Zou
Roxana Daneshjou
LM&MA
97
81
0
31 Aug 2023
A Classification-Guided Approach for Adversarial Attacks against Neural
  Machine Translation
A Classification-Guided Approach for Adversarial Attacks against Neural Machine Translation
Sahar Sadrizadeh
Ljiljana Dolamic
P. Frossard
AAMLSILM
83
2
0
29 Aug 2023
Identifying and Mitigating the Security Risks of Generative AI
Identifying and Mitigating the Security Risks of Generative AI
Clark W. Barrett
Bradley L Boyd
Ellie Burzstein
Nicholas Carlini
Brad Chen
...
Zulfikar Ramzan
Khawaja Shams
Basel Alomair
Ankur Taly
Diyi Yang
SILM
125
101
0
28 Aug 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
79
162
0
28 Aug 2023
Detecting Language Model Attacks with Perplexity
Detecting Language Model Attacks with Perplexity
Gabriel Alon
Michael Kamfonas
AAML
136
229
0
27 Aug 2023
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and
  Vulnerabilities
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities
Maximilian Mozes
Xuanli He
Bennett Kleinberg
Lewis D. Griffin
87
87
0
24 Aug 2023
Adversarial Illusions in Multi-Modal Embeddings
Adversarial Illusions in Multi-Modal Embeddings
Tingwei Zhang
Rishi Jha
Eugene Bagdasaryan
Vitaly Shmatikov
AAML
138
11
0
22 Aug 2023
Large Language Models Sensitivity to The Order of Options in
  Multiple-Choice Questions
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
Pouya Pezeshkpour
Estevam R. Hruschka
LRM
78
146
0
22 Aug 2023
Enhancing Adversarial Attacks: The Similar Target Method
Enhancing Adversarial Attacks: The Similar Target Method
Shuo Zhang
Ziruo Wang
Zikai Zhou
Huanran Chen
AAML
98
1
0
21 Aug 2023
On the Adversarial Robustness of Multi-Modal Foundation Models
On the Adversarial Robustness of Multi-Modal Foundation Models
Christian Schlarmann
Matthias Hein
AAML
180
107
0
21 Aug 2023
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion
  Customization
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization
Xiaoyu Ye
Hao Huang
Jiaqi An
Yongtao Wang
WIGM
85
23
0
19 Aug 2023
Red-Teaming Large Language Models using Chain of Utterances for
  Safety-Alignment
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Rishabh Bhardwaj
Soujanya Poria
ELM
108
159
0
18 Aug 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Anna Rogers
A. Luccioni
160
21
0
14 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
121
285
0
12 Aug 2023
CLEVA: Chinese Language Models EVAluation Platform
CLEVA: Chinese Language Models EVAluation Platform
Yanyang Li
Jianqiao Zhao
Duo Zheng
Zi-Yuan Hu
Zhi Chen
...
Yongfeng Huang
Shijia Huang
Dahua Lin
Michael R. Lyu
Liwei Wang
ALMELM
100
11
0
09 Aug 2023
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak
  Prompts on Large Language Models
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Xinyue Shen
Zhenpeng Chen
Michael Backes
Yun Shen
Yang Zhang
SILM
165
302
0
07 Aug 2023
Why We Don't Have AGI Yet
Why We Don't Have AGI Yet
Peter Voss
M. Jovanovic
VLM
30
2
0
07 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in
  Large Language Models
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger
Hannah Rose Kirk
Bertie Vidgen
Giuseppe Attanasio
Federico Bianchi
Dirk Hovy
ALMELMAILaw
120
154
0
02 Aug 2023
Getting from Generative AI to Trustworthy AI: What LLMs might learn from
  Cyc
Getting from Generative AI to Trustworthy AI: What LLMs might learn from Cyc
D. Lenat
G. Marcus
67
30
0
31 Jul 2023
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal
  Language Models
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
Erfan Shayegani
Yue Dong
Nael B. Abu-Ghazaleh
119
152
0
26 Jul 2023
Effective Prompt Extraction from Language Models
Effective Prompt Extraction from Language Models
Yiming Zhang
Nicholas Carlini
Daphne Ippolito
MIACVSILM
105
43
0
13 Jul 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi
Kaixuan Huang
Ashwinee Panda
Peter Henderson
Mengdi Wang
Prateek Mittal
AAML
120
172
0
22 Jun 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Stephen Casper
Jason Lin
Joe Kwon
Gatlen Culp
Dylan Hadfield-Menell
AAML
60
99
0
15 Jun 2023
Previous
123...212223
Next