ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
Endless Jailbreaks with Bijection Learning
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
177
8
0
02 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
231
6
0
02 Oct 2024
PyRIT: A Framework for Security Risk Identification and Red Teaming in
  Generative AI System
PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System
Gary D. Lopez Munoz
Amanda Minnich
Roman Lutz
Richard Lundeen
Raja Sekhar Rao Dheekonda
...
Tori Westerhoff
Chang Kawaguchi
Christian Seifert
Ram Shankar Siva Kumar
Yonatan Zunger
SILM
111
11
0
01 Oct 2024
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
Xuefeng Du
Reshmi Ghosh
Robert Sim
Ahmed Salem
Vitor Carvalho
Emily Lawton
Yixuan Li
Jack W. Stokes
VLMAAML
99
7
0
01 Oct 2024
Do Influence Functions Work on Large Language Models?
Do Influence Functions Work on Large Language Models?
Zhe Li
Wei Zhao
Yige Li
Jun Sun
TDI
92
3
0
30 Sep 2024
Mitigating Backdoor Threats to Large Language Models: Advancement and
  Challenges
Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges
Qin Liu
Wenjie Mo
Terry Tong
Lyne Tchapmi
Fei Wang
Chaowei Xiao
Muhao Chen
AAML
92
4
0
30 Sep 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
150
19
0
30 Sep 2024
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending
  Against Prompt Injection Attacks
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks
Rongchang Li
Minjie Chen
Chang Hu
Han Chen
Wenpeng Xing
Meng Han
SILMELM
56
2
0
29 Sep 2024
Identifying Knowledge Editing Types in Large Language Models
Identifying Knowledge Editing Types in Large Language Models
Xiaopeng Li
Shasha Li
Shangwen Wang
Shezheng Song
Bin Ji
Huijun Liu
Jun Ma
Jie Yu
KELM
75
2
0
29 Sep 2024
Multimodal Pragmatic Jailbreak on Text-to-image Models
Multimodal Pragmatic Jailbreak on Text-to-image Models
Tong Liu
Zhixin Lai
Jiawen Wang
Gengyuan Zhang
Shuo Chen
Philip Torr
Vera Demberg
Volker Tresp
Jindong Gu
73
5
0
27 Sep 2024
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A
  Survey
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
132
46
0
26 Sep 2024
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard
  for Prompt Attacks
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks
Giandomenico Cornacchia
Giulio Zizzo
Kieran Fraser
Muhammad Zaid Hameed
Ambrish Rawat
Mark Purcell
75
3
0
26 Sep 2024
AI Delegates with a Dual Focus: Ensuring Privacy and Strategic
  Self-Disclosure
AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure
Xi Chen
Zhiyang Zhang
Fangkai Yang
Xiaoting Qin
Chao Du
...
Hangxin Liu
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
Qi Zhang
39
1
0
26 Sep 2024
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
Yifan Jiang
Kriti Aggarwal
Tanmay Laud
Kashif Munir
Jay Pujara
Subhabrata Mukherjee
AAML
116
0
0
26 Sep 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MUAAML
206
53
0
26 Sep 2024
LLM Echo Chamber: personalized and automated disinformation
LLM Echo Chamber: personalized and automated disinformation
Tony Ma
46
1
0
24 Sep 2024
Steward: Natural Language Web Automation
Steward: Natural Language Web Automation
Brian Tang
Kang G. Shin
LLMAG
61
1
0
23 Sep 2024
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in
  Red Teaming GenAI
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
Ambrish Rawat
Stefan Schoepf
Giulio Zizzo
Giandomenico Cornacchia
Muhammad Zaid Hameed
...
Elizabeth M. Daly
Mark Purcell
P. Sattigeri
Pin-Yu Chen
Kush R. Varshney
AAML
104
8
0
23 Sep 2024
Backtracking Improves Generation Safety
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
124
8
0
22 Sep 2024
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement
  Learning-Based Jailbreak Approach
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach
Zhihao Lin
Wei Ma
Mingyi Zhou
Yanjie Zhao
Haoyu Wang
Yang Liu
Jun Wang
Li Li
AAML
91
8
0
21 Sep 2024
Prompt Obfuscation for Large Language Models
Prompt Obfuscation for Large Language Models
David Pape
Thorsten Eisenhofer
Thorsten Eisenhofer
Lea Schönherr
AAML
172
4
0
17 Sep 2024
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks
  against RAG-based Inference in Scale and Severity Using Jailbreaking
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking
Stav Cohen
Ron Bitton
Ben Nassi
85
7
0
12 Sep 2024
Securing Vision-Language Models with a Robust Encoder Against Jailbreak
  and Adversarial Attacks
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain
Ahmed Imteaj
AAMLVLM
81
6
0
11 Sep 2024
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting
  LLMs
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
Lijia Lv
Weigang Zhang
Xuehai Tang
Jie Wen
Feng Liu
Jizhong Han
Songlin Hu
AAML
72
2
0
11 Sep 2024
DiPT: Enhancing LLM reasoning through diversified perspective-taking
DiPT: Enhancing LLM reasoning through diversified perspective-taking
H. Just
Mahavir Dabas
Lifu Huang
Ming Jin
Ruoxi Jia
LRM
74
1
0
10 Sep 2024
Towards Safe Multilingual Frontier AI
Towards Safe Multilingual Frontier AI
Artūrs Kanepajs
Vladimir Ivanov
Richard Moulange
81
2
0
06 Sep 2024
An overview of domain-specific foundation model: key technologies, applications and challenges
An overview of domain-specific foundation model: key technologies, applications and challenges
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALMVLM
111
5
0
06 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
121
2
0
05 Sep 2024
ContextCite: Attributing Model Generation to Context
ContextCite: Attributing Model Generation to Context
Benjamin Cohen-Wang
Harshay Shah
Kristian Georgiev
Aleksander Madry
LRM
93
30
0
01 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
138
18
0
01 Sep 2024
Acceptable Use Policies for Foundation Models
Acceptable Use Policies for Foundation Models
Kevin Klyman
69
17
0
29 Aug 2024
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational
  Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
Aman Priyanshu
Supriti Vijay
AAML
67
1
0
28 Aug 2024
Legilimens: Practical and Unified Content Moderation for Large Language
  Model Services
Legilimens: Practical and Unified Content Moderation for Large Language Model Services
Jialin Wu
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Jiayang Xu
Xinfeng Li
Wei Dong
135
8
0
28 Aug 2024
AgentMonitor: A Plug-and-Play Framework for Predictive and Secure
  Multi-Agent Systems
AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems
Chi-Min Chan
Jianxuan Yu
Weize Chen
Chunyang Jiang
Xinyu Liu
Weijie Shi
Zhiyuan Liu
Wei Xue
Yike Guo
LLMAG
85
3
0
27 Aug 2024
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language
  Models
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
Hongfu Liu
Yuxi Xie
Ye Wang
Michael Shieh
140
3
0
27 Aug 2024
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang
Philip Torr
Mohamed Elhoseiny
Adel Bibi
200
15
0
27 Aug 2024
LLM-PBE: Assessing Data Privacy in Large Language Models
LLM-PBE: Assessing Data Privacy in Large Language Models
Qinbin Li
Junyuan Hong
Chulin Xie
Jeffrey Tan
Rachel Xin
...
Dan Hendrycks
Zhangyang Wang
Bo Li
Bingsheng He
Dawn Song
ELMPILM
116
18
0
23 Aug 2024
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
Yige Li
Hanxun Huang
Yunhan Zhao
Xingjun Ma
Jun Sun
AAMLSILM
113
1
0
23 Aug 2024
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
Mamadou Keita
W. Hamidouche
Hessen Bougueffa Eutamene
Abdelmalik Taleb-Ahmed
Abdenour Hadid
VLM
130
1
0
22 Aug 2024
Approaching Deep Learning through the Spectral Dynamics of Weights
Approaching Deep Learning through the Spectral Dynamics of Weights
David Yunis
Kumar Kshitij Patel
Samuel Wheeler
Pedro H. P. Savarese
Gal Vardi
Karen Livescu
Michael Maire
Matthew R. Walter
112
3
0
21 Aug 2024
Efficient Detection of Toxic Prompts in Large Language Models
Efficient Detection of Toxic Prompts in Large Language Models
Yi Liu
Junzhe Yu
Huijia Sun
Ling Shi
Gelei Deng
Yuqi Chen
Yang Liu
98
6
0
21 Aug 2024
EEG-Defender: Defending against Jailbreak through Early Exit Generation
  of Large Language Models
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
67
3
0
21 Aug 2024
Learning Randomized Algorithms with Transformers
Learning Randomized Algorithms with Transformers
J. Oswald
Seijin Kobayashi
Yassir Akram
Angelika Steger
AAML
84
1
0
20 Aug 2024
Towards Robust Knowledge Unlearning: An Adversarial Framework for
  Assessing and Improving Unlearning Robustness in Large Language Models
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models
Hongbang Yuan
Zhuoran Jin
Pengfei Cao
Yubo Chen
Kang Liu
Jun Zhao
AAMLELMMU
90
9
0
20 Aug 2024
Probing the Safety Response Boundary of Large Language Models via Unsafe
  Decoding Path Generation
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation
Haoyu Wang
Bingzhe Wu
Yatao Bian
Yongzhe Chang
Xueqian Wang
Peilin Zhao
142
2
0
20 Aug 2024
Promoting Equality in Large Language Models: Identifying and Mitigating
  the Implicit Bias based on Bayesian Theory
Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory
Yongxin Deng
Xihe Qiu
Jue Chen
Jing Pan
Chen Jue
Zhijun Fang
Yinghui Xu
Wei Chu
Yuan Qi
67
3
0
20 Aug 2024
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles
Zhilong Wang
Haizhou Wang
Nanqing Luo
Lan Zhang
Xiaoyan Sun
Yebo Cao
Peng Liu
66
0
0
20 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak
  Attacks
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Donghai Hong
Jiaying Chen
Wenhai Wang
74
3
0
18 Aug 2024
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
Yulin Chen
Haoran Li
Zihao Zheng
Zihao Zheng
Yangqiu Song
Bryan Hooi
178
7
0
17 Aug 2024
$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and
  Defenses for Vision Language Models
MMJ-Bench\textit{MMJ-Bench}MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
Fenghua Weng
Yue Xu
Chengyan Fu
Wenjie Wang
AAML
90
0
0
16 Aug 2024
Previous
123...101112...212223
Next