ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
Removing RLHF Protections in GPT-4 via Fine-Tuning
Removing RLHF Protections in GPT-4 via Fine-Tuning
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MUAAML
101
104
0
09 Nov 2023
Frontier Language Models are not Robust to Adversarial Arithmetic, or
  "What do I need to say so you agree 2+2=5?
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
C. D. Freeman
Laura J. Culp
Aaron T Parisi
Maxwell Bileschi
Gamaleldin F. Elsayed
...
Peter J. Liu
Roman Novak
Yundi Qian
Noah Fiedel
Jascha Narain Sohl-Dickstein
AAML
64
2
0
08 Nov 2023
Do LLMs exhibit human-like response biases? A case study in survey
  design
Do LLMs exhibit human-like response biases? A case study in survey design
Lindia Tjuatja
Valerie Chen
Sherry Tongshuang Wu
Ameet Talwalkar
Graham Neubig
122
101
0
07 Nov 2023
Scalable and Transferable Black-Box Jailbreaks for Language Models via
  Persona Modulation
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
Rusheb Shah
Quentin Feuillade--Montixi
Soroush Pour
Arush Tagade
Stephen Casper
Javier Rando
93
139
0
06 Nov 2023
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li
Zhanke Zhou
Jianing Zhu
Jiangchao Yao
Tongliang Liu
Bo Han
131
190
0
06 Nov 2023
Can LLMs Follow Simple Rules?
Can LLMs Follow Simple Rules?
Norman Mu
Sarah Chen
Zifan Wang
Sizhe Chen
David Karamardian
Lulwa Aljeraisy
Basel Alomair
Dan Hendrycks
David Wagner
ALM
91
32
0
06 Nov 2023
Market Concentration Implications of Foundation Models
Market Concentration Implications of Foundation Models
Jai Vipra
Anton Korinek
ELM
84
17
0
02 Nov 2023
Implicit Chain of Thought Reasoning via Knowledge Distillation
Implicit Chain of Thought Reasoning via Knowledge Distillation
Yuntian Deng
Kiran Prasad
Roland Fernandez
P. Smolensky
Vishrav Chaudhary
Stuart M. Shieber
ReLMLRM
74
52
0
02 Nov 2023
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer
Olivia Watkins
Ethan Mendes
Justin Svegliato
Luke Bailey
...
Karim Elmaaroufi
Pieter Abbeel
Trevor Darrell
Alan Ritter
Stuart J. Russell
111
79
0
02 Nov 2023
Robust Safety Classifier for Large Language Models: Adversarial Prompt
  Shield
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
Jinhwa Kim
Ali Derakhshan
Ian G. Harris
AAML
189
18
0
31 Oct 2023
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Pranav M. Gade
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALMAI4MH
92
27
0
31 Oct 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
77
92
0
31 Oct 2023
Adversarial Attacks and Defenses in Large Language Models: Old and New
  Threats
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Leo Schwinn
David Dobre
Stephan Günnemann
Gauthier Gidel
AAMLELM
100
41
0
30 Oct 2023
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and
  Limitations
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
Aleksandar Petrov
Philip Torr
Adel Bibi
VPVLM
82
28
0
30 Oct 2023
BERT Lost Patience Won't Be Robust to Adversarial Slowdown
BERT Lost Patience Won't Be Robust to Adversarial Slowdown
Zachary Coalson
Gabriel Ritter
Rakesh Bobba
Sanghyun Hong
AAML
47
2
0
29 Oct 2023
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image
  Detectors
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors
You-Ming Chang
Chen Yeh
Wei-Chen Chiu
Ning Yu
VPVLMVLM
154
30
0
26 Oct 2023
Self-Guard: Empower the LLM to Safeguard Itself
Self-Guard: Empower the LLM to Safeguard Itself
Zezhong Wang
Fangkai Yang
Lu Wang
Pu Zhao
Hongru Wang
Liang Chen
Qingwei Lin
Kam-Fai Wong
166
35
0
24 Oct 2023
Unnatural language processing: How do language models handle
  machine-generated prompts?
Unnatural language processing: How do language models handle machine-generated prompts?
Corentin Kervadec
Francesca Franzon
Marco Baroni
61
5
0
24 Oct 2023
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
  Language Models
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu
Ruiyi Zhang
Bang An
Gang Wu
Joe Barrow
Zichao Wang
Furong Huang
A. Nenkova
Tong Sun
SILMAAML
88
49
0
23 Oct 2023
An LLM can Fool Itself: A Prompt-Based Adversarial Attack
An LLM can Fool Itself: A Prompt-Based Adversarial Attack
Xilie Xu
Keyi Kong
Ning Liu
Li-zhen Cui
Di Wang
Jingfeng Zhang
Mohan Kankanhalli
AAMLSILM
126
88
0
20 Oct 2023
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Yupei Liu
Yuqi Jia
Runpeng Geng
Jinyuan Jia
Neil Zhenqiang Gong
SILMLLMAG
133
97
0
19 Oct 2023
Quantifying Language Models' Sensitivity to Spurious Features in Prompt
  Design or: How I learned to start worrying about prompt formatting
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Melanie Sclar
Yejin Choi
Yulia Tsvetkov
Alane Suhr
108
361
0
17 Oct 2023
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications
  with Programmable Rails
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea
R. Dinu
Makesh Narsimhan Sreedhar
Christopher Parisien
Jonathan Cohen
KELM
99
152
0
16 Oct 2023
Privacy in Large Language Models: Attacks, Defenses and Future
  Directions
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
118
44
0
16 Oct 2023
Prompt Packer: Deceiving LLMs through Compositional Instruction with
  Hidden Attacks
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks
Shuyu Jiang
Xingshu Chen
Rui Tang
93
25
0
16 Oct 2023
Digital Deception: Generative Artificial Intelligence in Social
  Engineering and Phishing
Digital Deception: Generative Artificial Intelligence in Social Engineering and Phishing
Marc Schmitt
Ivan Flechais
101
48
0
15 Oct 2023
Is Certifying $\ell_p$ Robustness Still Worthwhile?
Is Certifying ℓp\ell_pℓp​ Robustness Still Worthwhile?
Ravi Mangal
Klas Leino
Zifan Wang
Kai Hu
Weicheng Yu
Corina S. Pasareanu
Anupam Datta
Matt Fredrikson
AAMLOOD
84
1
0
13 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
165
710
0
12 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
84
312
0
10 Oct 2023
Multilingual Jailbreak Challenges in Large Language Models
Multilingual Jailbreak Challenges in Large Language Models
Yue Deng
Wenxuan Zhang
Sinno Jialin Pan
Lidong Bing
AAML
148
143
0
10 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
131
279
0
10 Oct 2023
The Emergence of Reproducibility and Generalizability in Diffusion
  Models
The Emergence of Reproducibility and Generalizability in Diffusion Models
Huijie Zhang
Jinfan Zhou
Yifu Lu
Minzhe Guo
Peng Wang
Liyue Shen
Qing Qu
DiffM
80
4
0
08 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
167
635
0
05 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
188
260
0
05 Oct 2023
Adversarial Machine Learning for Social Good: Reframing the Adversary as
  an Ally
Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
Shawqi Al-Maliki
Adnan Qayyum
Hassan Ali
M. Abdallah
Junaid Qadir
D. Hoang
Dusit Niyato
Ala I. Al-Fuqaha
AAML
121
3
0
05 Oct 2023
Misusing Tools in Large Language Models With Visual Adversarial Examples
Misusing Tools in Large Language Models With Visual Adversarial Examples
Xiaohan Fu
Zihan Wang
Shuheng Li
Rajesh K. Gupta
Niloofar Mireshghallah
Taylor Berg-Kirkpatrick
Earlence Fernandes
AAML
85
27
0
04 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Qi Zhang
Linda R. Petzold
William Y. Wang
Xun Zhao
Dahua Lin
83
190
0
04 Oct 2023
Low-Resource Languages Jailbreak GPT-4
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong
Cristina Menghini
Stephen H. Bach
SILM
122
205
0
03 Oct 2023
Jailbreaker in Jail: Moving Target Defense for Large Language Models
Jailbreaker in Jail: Moving Target Defense for Large Language Models
Bocheng Chen
Advait Paliwal
Qiben Yan
AAML
52
15
0
03 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language
  Models
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
100
334
0
03 Oct 2023
Can Language Models be Instructed to Protect Personal Information?
Can Language Models be Instructed to Protect Personal Information?
Yang Chen
Ethan Mendes
Sauvik Das
Wei Xu
Alan Ritter
PILM
77
37
0
03 Oct 2023
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
Qiming Xie
Zengzhi Wang
Yi Feng
Rui Xia
AAMLHILM
111
9
0
03 Oct 2023
Large Language Models Cannot Self-Correct Reasoning Yet
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang
Xinyun Chen
Swaroop Mishra
Huaixiu Steven Zheng
Adams Wei Yu
Xinying Song
Denny Zhou
ReLMLRM
103
488
0
03 Oct 2023
LoFT: Local Proxy Fine-tuning For Improving Transferability Of
  Adversarial Attacks Against Large Language Model
LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model
Muhammad Ahmed Shah
Roshan S. Sharma
Hira Dhamyal
R. Olivier
Ankit Shah
...
Massa Baali
Soham Deshmukh
Michael Kuhlmann
Bhiksha Raj
Rita Singh
AAML
67
21
0
02 Oct 2023
What's the Magic Word? A Control Theory of LLM Prompting
What's the Magic Word? A Control Theory of LLM Prompting
Aman Bhargava
Cameron Witkowski
Manav Shah
Matt W. Thomson
LLMAG
123
31
0
02 Oct 2023
On the Safety of Open-Sourced Large Language Models: Does Alignment
  Really Prevent Them From Being Misused?
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang
Zhimeng Guo
Huaisheng Zhu
Bochuan Cao
Lu Lin
Jinyuan Jia
Jinghui Chen
Di Wu
119
25
0
02 Oct 2023
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
  Against Extraction Attacks
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Joey Tianyi Zhou
KELMAAML
136
108
0
29 Sep 2023
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks,
  benefits, and alternative methods for pursuing open-source objectives
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
Elizabeth Seger
Noemi Dreksler
Richard Moulange
Emily Dardaman
Jonas Schuett
...
Emma Bluemke
Michael Aird
Patrick Levermore
Julian Hazell
Abhishek Gupta
74
43
0
29 Sep 2023
Language Models as a Service: Overview of a New Paradigm and its
  Challenges
Language Models as a Service: Overview of a New Paradigm and its Challenges
Emanuele La Malfa
Aleksandar Petrov
Simon Frieder
Christoph Weinhuber
Ryan Burnell
Raza Nazar
Anthony Cohn
Nigel Shadbolt
Michael Wooldridge
ALMELM
75
6
0
28 Sep 2023
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
  Unrelated Questions
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Lorenzo Pacchiardi
A. J. Chan
Sören Mindermann
Ilan Moscovitz
Alexa Y. Pan
Y. Gal
Owain Evans
J. Brauner
LLMAGHILM
80
54
0
26 Sep 2023
Previous
123...20212223
Next