ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXivPDFHTML

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 948 papers shown
Title
Robust Safety Classifier for Large Language Models: Adversarial Prompt
  Shield
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
Jinhwa Kim
Ali Derakhshan
Ian G. Harris
AAML
104
16
0
31 Oct 2023
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Pranav M. Gade
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
AI4MH
15
23
0
31 Oct 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
31
83
0
31 Oct 2023
Adversarial Attacks and Defenses in Large Language Models: Old and New
  Threats
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Leo Schwinn
David Dobre
Stephan Günnemann
Gauthier Gidel
AAML
ELM
29
39
0
30 Oct 2023
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and
  Limitations
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
Aleksandar Petrov
Philip Torr
Adel Bibi
VPVLM
32
22
0
30 Oct 2023
BERT Lost Patience Won't Be Robust to Adversarial Slowdown
BERT Lost Patience Won't Be Robust to Adversarial Slowdown
Zachary Coalson
Gabriel Ritter
Rakesh Bobba
Sanghyun Hong
AAML
24
1
0
29 Oct 2023
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image
  Detectors
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors
You-Ming Chang
Chen Yeh
Wei-Chen Chiu
Ning Yu
VPVLM
VLM
78
23
0
26 Oct 2023
Self-Guard: Empower the LLM to Safeguard Itself
Self-Guard: Empower the LLM to Safeguard Itself
Zezhong Wang
Fangkai Yang
Lu Wang
Pu Zhao
Hongru Wang
Liang Chen
Qingwei Lin
Kam-Fai Wong
83
29
0
24 Oct 2023
Unnatural language processing: How do language models handle
  machine-generated prompts?
Unnatural language processing: How do language models handle machine-generated prompts?
Corentin Kervadec
Francesca Franzon
Marco Baroni
23
5
0
24 Oct 2023
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
  Language Models
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu
Ruiyi Zhang
Bang An
Gang Wu
Joe Barrow
Zichao Wang
Furong Huang
A. Nenkova
Tong Sun
SILM
AAML
30
41
0
23 Oct 2023
An LLM can Fool Itself: A Prompt-Based Adversarial Attack
An LLM can Fool Itself: A Prompt-Based Adversarial Attack
Xilie Xu
Keyi Kong
Ning Liu
Li-zhen Cui
Di Wang
Jingfeng Zhang
Mohan Kankanhalli
AAML
SILM
36
68
0
20 Oct 2023
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Yupei Liu
Yuqi Jia
Runpeng Geng
Jinyuan Jia
Neil Zhenqiang Gong
SILM
LLMAG
27
63
0
19 Oct 2023
Quantifying Language Models' Sensitivity to Spurious Features in Prompt
  Design or: How I learned to start worrying about prompt formatting
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Melanie Sclar
Yejin Choi
Yulia Tsvetkov
Alane Suhr
53
306
0
17 Oct 2023
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications
  with Programmable Rails
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea
R. Dinu
Makesh Narsimhan Sreedhar
Christopher Parisien
Jonathan Cohen
KELM
21
133
0
16 Oct 2023
Privacy in Large Language Models: Attacks, Defenses and Future
  Directions
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
50
42
0
16 Oct 2023
Prompt Packer: Deceiving LLMs through Compositional Instruction with
  Hidden Attacks
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks
Shuyu Jiang
Xingshu Chen
Rui Tang
24
22
0
16 Oct 2023
Digital Deception: Generative Artificial Intelligence in Social
  Engineering and Phishing
Digital Deception: Generative Artificial Intelligence in Social Engineering and Phishing
Marc Schmitt
Ivan Flechais
26
36
0
15 Oct 2023
Is Certifying $\ell_p$ Robustness Still Worthwhile?
Is Certifying ℓp\ell_pℓp​ Robustness Still Worthwhile?
Ravi Mangal
Klas Leino
Zifan Wang
Kai Hu
Weicheng Yu
Corina S. Pasareanu
Anupam Datta
Matt Fredrikson
AAML
OOD
33
1
0
13 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
61
582
0
12 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
35
273
0
10 Oct 2023
Multilingual Jailbreak Challenges in Large Language Models
Multilingual Jailbreak Challenges in Large Language Models
Yue Deng
Wenxuan Zhang
Sinno Jialin Pan
Lidong Bing
AAML
36
114
0
10 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
54
237
0
10 Oct 2023
The Emergence of Reproducibility and Generalizability in Diffusion
  Models
The Emergence of Reproducibility and Generalizability in Diffusion Models
Huijie Zhang
Jinfan Zhou
Yifu Lu
Minzhe Guo
Peng Wang
Liyue Shen
Qing Qu
DiffM
28
2
0
08 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
70
533
0
05 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
49
220
0
05 Oct 2023
Adversarial Machine Learning for Social Good: Reframing the Adversary as
  an Ally
Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
Shawqi Al-Maliki
Adnan Qayyum
Hassan Ali
M. Abdallah
Junaid Qadir
D. Hoang
Dusit Niyato
Ala I. Al-Fuqaha
AAML
34
3
0
05 Oct 2023
Misusing Tools in Large Language Models With Visual Adversarial Examples
Misusing Tools in Large Language Models With Visual Adversarial Examples
Xiaohan Fu
Zihan Wang
Shuheng Li
Rajesh K. Gupta
Niloofar Mireshghallah
Taylor Berg-Kirkpatrick
Earlence Fernandes
AAML
31
24
0
04 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Qi Zhang
Linda R. Petzold
William Y. Wang
Xun Zhao
Dahua Lin
26
163
0
04 Oct 2023
Low-Resource Languages Jailbreak GPT-4
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong
Cristina Menghini
Stephen H. Bach
SILM
31
173
0
03 Oct 2023
Jailbreaker in Jail: Moving Target Defense for Large Language Models
Jailbreaker in Jail: Moving Target Defense for Large Language Models
Bocheng Chen
Advait Paliwal
Qiben Yan
AAML
39
14
0
03 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language
  Models
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
38
262
0
03 Oct 2023
Can Language Models be Instructed to Protect Personal Information?
Can Language Models be Instructed to Protect Personal Information?
Yang Chen
Ethan Mendes
Sauvik Das
Wei-ping Xu
Alan Ritter
PILM
27
35
0
03 Oct 2023
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
Qiming Xie
Zengzhi Wang
Yi Feng
Rui Xia
AAML
HILM
35
9
0
03 Oct 2023
Large Language Models Cannot Self-Correct Reasoning Yet
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang
Xinyun Chen
Swaroop Mishra
Huaixiu Steven Zheng
Adams Wei Yu
Xinying Song
Denny Zhou
ReLM
LRM
38
422
0
03 Oct 2023
LoFT: Local Proxy Fine-tuning For Improving Transferability Of
  Adversarial Attacks Against Large Language Model
LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model
Muhammad Ahmed Shah
Roshan S. Sharma
Hira Dhamyal
R. Olivier
Ankit Shah
...
Massa Baali
Soham Deshmukh
Michael Kuhlmann
Bhiksha Raj
Rita Singh
AAML
30
19
0
02 Oct 2023
What's the Magic Word? A Control Theory of LLM Prompting
What's the Magic Word? A Control Theory of LLM Prompting
Aman Bhargava
Cameron Witkowski
Manav Shah
Matt W. Thomson
LLMAG
61
30
0
02 Oct 2023
On the Safety of Open-Sourced Large Language Models: Does Alignment
  Really Prevent Them From Being Misused?
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang
Zhimeng Guo
Huaisheng Zhu
Bochuan Cao
Lu Lin
Jinyuan Jia
Jinghui Chen
Di Wu
78
23
0
02 Oct 2023
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
  Against Extraction Attacks
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Joey Tianyi Zhou
KELM
AAML
31
97
0
29 Sep 2023
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks,
  benefits, and alternative methods for pursuing open-source objectives
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
Elizabeth Seger
Noemi Dreksler
Richard Moulange
Emily Dardaman
Jonas Schuett
...
Emma Bluemke
Michael Aird
Patrick Levermore
Julian Hazell
Abhishek Gupta
25
40
0
29 Sep 2023
Language Models as a Service: Overview of a New Paradigm and its
  Challenges
Language Models as a Service: Overview of a New Paradigm and its Challenges
Emanuele La Malfa
Aleksandar Petrov
Simon Frieder
Christoph Weinhuber
Ryan Burnell
Raza Nazar
Anthony Cohn
Nigel Shadbolt
Michael Wooldridge
ALM
ELM
35
3
0
28 Sep 2023
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
  Unrelated Questions
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Lorenzo Pacchiardi
A. J. Chan
Sören Mindermann
Ilan Moscovitz
Alexa Y. Pan
Y. Gal
Owain Evans
J. Brauner
LLMAG
HILM
22
49
0
26 Sep 2023
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
24
177
0
26 Sep 2023
Can LLM-Generated Misinformation Be Detected?
Can LLM-Generated Misinformation Be Detected?
Canyu Chen
Kai Shu
DeLMO
39
158
0
25 Sep 2023
ALLURE: Auditing and Improving LLM-based Evaluation of Text using
  Iterative In-Context-Learning
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning
Hosein Hasanbeig
Hiteshi Sharma
Leo Betthauser
Felipe Vieira Frujeri
Ida Momennejad
38
15
0
24 Sep 2023
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Tianle Li
Siyuan Zhuang
...
Zi Lin
Eric P. Xing
Joseph E. Gonzalez
Ion Stoica
Haotong Zhang
29
180
0
21 Sep 2023
Knowledge Sanitization of Large Language Models
Knowledge Sanitization of Large Language Models
Yoichi Ishibashi
Hidetoshi Shimodaira
KELM
39
19
0
21 Sep 2023
How Robust is Google's Bard to Adversarial Image Attacks?
How Robust is Google's Bard to Adversarial Image Attacks?
Yinpeng Dong
Huanran Chen
Jiawei Chen
Zhengwei Fang
Xiaohu Yang
Yichi Zhang
Yu Tian
Hang Su
Jun Zhu
AAML
36
102
0
21 Sep 2023
Model Leeching: An Extraction Attack Targeting LLMs
Model Leeching: An Extraction Attack Targeting LLMs
Lewis Birch
William Hackett
Stefan Trawicki
N. Suri
Peter Garraghan
32
13
0
19 Sep 2023
LLM Platform Security: Applying a Systematic Evaluation Framework to
  OpenAI's ChatGPT Plugins
LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins
Umar Iqbal
Tadayoshi Kohno
Franziska Roesner
ELM
SILM
74
49
0
19 Sep 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated
  Jailbreak Prompts
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
119
303
0
19 Sep 2023
Previous
123...171819
Next