ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts
  Discovery from Large-Scale Human-LLM Conversational Datasets
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
Zhihua Jin
Shiyi Liu
Haotian Li
Xun Zhao
Huamin Qu
85
4
0
03 Jul 2024
Purple-teaming LLMs with Adversarial Defender Training
Purple-teaming LLMs with Adversarial Defender Training
Jingyan Zhou
Kun Li
Junan Li
Jiawen Kang
Minda Hu
Xixin Wu
Helen Meng
AAML
71
1
0
01 Jul 2024
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
Xiaotian Zou
Ke Li
Yongkang Chen
MLLM
61
2
0
01 Jul 2024
Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal
  Models
Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models
Shaeke Salman
M. Shams
Xiuwen Liu
60
2
0
01 Jul 2024
Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
Zisu Huang
Xiaohua Wang
Feiran Zhang
Zhibo Xu
Cenyuan Zhang
Qi Qian
Xiaoqing Zheng
Xuanjing Huang
AAMLLRM
111
4
0
01 Jul 2024
Too Late to Train, Too Early To Use? A Study on Necessity and Viability
  of Low-Resource Bengali LLMs
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz
Satak Kumar Dey
Ruwad Naswan
Hasnaen Adil
Khondker Salman Sayeed
Haz Sameen Shahgir
71
1
0
29 Jun 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILMAAML
103
35
0
28 Jun 2024
Virtual Context: Enhancing Jailbreak Attacks with Special Token
  Injection
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
Yuqi Zhou
Lin Lu
Hanchi Sun
Pan Zhou
Lichao Sun
86
10
0
28 Jun 2024
Jailbreaking LLMs with Arabic Transliteration and Arabizi
Jailbreaking LLMs with Arabic Transliteration and Arabizi
Mansour Al Ghanim
Saleh Almohaimeed
Mengxin Zheng
Yan Solihin
Qian Lou
65
4
0
26 Jun 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
  and Refusals of LLMs
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
128
101
0
26 Jun 2024
Do LLMs dream of elephants (when told not to)? Latent concept
  association and associative memory in transformers
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang
Goutham Rajendran
Pradeep Ravikumar
Bryon Aragam
CLLKELM
92
8
0
26 Jun 2024
SafeAligner: Safety Alignment against Jailbreak Attacks via Response
  Disparity Guidance
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang
Wanxu Zhao
Rui Zheng
Huijie Lv
Shihan Dou
...
Junjie Ye
Yuming Yang
Tao Gui
Qi Zhang
Xuanjing Huang
LLMSVAAML
121
9
0
26 Jun 2024
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in
  Retrieval-Augmented Generative Models
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
Zhen Tan
Chengshuai Zhao
Raha Moraffah
Yifan Li
Song Wang
Jundong Li
Tianlong Chen
Huan Liu
SILM
101
25
0
26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large
  Language and Vision-Language Models
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
99
32
0
26 Jun 2024
AI Risk Categorization Decoded (AIR 2024): From Government Regulations
  to Corporate Policies
AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies
Yi Zeng
Kevin Klyman
Andy Zhou
Yu Yang
Minzhou Pan
Ruoxi Jia
Dawn Song
Percy Liang
Bo Li
98
27
0
25 Jun 2024
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
Thom Lake
Eunsol Choi
Greg Durrett
111
14
0
25 Jun 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in
  Instruction-tuned Language Models
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng
Weiyu Sun
Tran Ngoc Huynh
Dawn Song
Bo Li
Ruoxi Jia
AAMLLLMSV
70
25
0
24 Jun 2024
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large
  Language Models via Opposite Prompt Optimization
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
Zhengyue Zhao
Xiaoyun Zhang
Kaidi Xu
Xing Hu
Rui Zhang
Zidong Du
Qi Guo
Yunji Chen
71
8
0
24 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
Yifan Wang
A. Grama
Ruqi Zhang
Ruqi Zhang
AI4TS
145
15
0
24 Jun 2024
Serial Position Effects of Large Language Models
Serial Position Effects of Large Language Models
Xiaobo Guo
Soroush Vosoughi
81
8
0
23 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of
  Language Models
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSVAAML
112
24
0
21 Jun 2024
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers
Manuel Mondal
Ljiljana Dolamic
Gérôme Bovet
Philippe Cudré-Mauroux
Julien Audiffren
98
2
0
21 Jun 2024
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Anton Xue
Avishree Khare
Rajeev Alur
Surbhi Goel
Eric Wong
174
3
0
21 Jun 2024
The Fire Thief Is Also the Keeper: Balancing Usability and Privacy in
  Prompts
The Fire Thief Is Also the Keeper: Balancing Usability and Privacy in Prompts
Zhili Shen
Zihang Xi
Ying He
Wei Tong
Jingyu Hua
Sheng Zhong
SILM
86
8
0
20 Jun 2024
Prompt Injection Attacks in Defended Systems
Prompt Injection Attacks in Defended Systems
Daniil Khomsky
Narek Maloyan
Bulat Nutfullin
AAMLSILM
78
4
0
20 Jun 2024
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Yalan Qin
Chongye Guo
Borong Zhang
Boyuan Chen
Josef Dai
...
Kaile Wang
Boxuan Li
Sirui Han
Yike Guo
Yaodong Yang
95
51
0
20 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALMELM
189
79
0
20 Jun 2024
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for
  LLM Agents
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
Edoardo Debenedetti
Jie Zhang
Mislav Balunović
Luca Beurer-Kellner
Marc Fischer
Florian Tramèr
LLMAGAAML
130
45
1
19 Jun 2024
Enhancing Cross-Prompt Transferability in Vision-Language Models through
  Contextual Injection of Target Tokens
Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens
Xikang Yang
Xuehai Tang
Fuqing Zhu
Jizhong Han
Songlin Hu
VLMAAML
79
1
0
19 Jun 2024
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in
  LLM Text Generation
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
Xiaoze Liu
Ting Sun
Tianyang Xu
Feijie Wu
Cunxiang Wang
Xiaoqian Wang
Jing Gao
AAMLDeLMOAILaw
126
22
0
18 Jun 2024
[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Abhinav Rao
Monojit Choudhury
Somak Aditya
77
0
0
18 Jun 2024
Adversarial Attacks on Large Language Models in Medicine
Adversarial Attacks on Large Language Models in Medicine
Yifan Yang
Qiao Jin
Furong Huang
Zhiyong Lu
AAML
112
6
0
18 Jun 2024
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li
Zhangchen Xu
Fengqing Jiang
Luyao Niu
D. Sahabandu
Bhaskar Ramasubramanian
Radha Poovendran
SILMAAML
122
10
0
18 Jun 2024
IDs for AI Systems
IDs for AI Systems
Alan Chan
Noam Kolt
Peter Wills
Usman Anwar
Christian Schroeder de Witt
Nitarshan Rajkumar
Lewis Hammond
David M. Krueger
Lennart Heim
Markus Anderljung
102
7
0
17 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
98
8
0
17 Jun 2024
Dialogue Action Tokens: Steering Language Models in Goal-Directed
  Dialogue with a Multi-Turn Planner
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Kenneth Li
Yiming Wang
Fernanda Viégas
Martin Wattenberg
77
7
0
17 Jun 2024
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective
  Unlearning in LLMs
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs
S. Kadhe
Farhan Ahmed
Dennis Wei
Nathalie Baracaldo
Inkit Padhi
MoMeMU
90
8
0
17 Jun 2024
STAR: SocioTechnical Approach to Red Teaming Language Models
STAR: SocioTechnical Approach to Red Teaming Language Models
Laura Weidinger
John F. J. Mellor
Bernat Guillen Pegueroles
Nahema Marchal
Ravin Kumar
...
Mark Diaz
Stevie Bergman
Mikel Rodriguez
Verena Rieser
William S. Isaac
VLM
78
7
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
169
218
0
17 Jun 2024
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and
  Lexical Alterations
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala
Aman Jaiswal
Chandramouli Shama Sastry
E. Milios
Sageev Oore
Hassan Sajjad
CoGe
94
12
0
17 Jun 2024
Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models
Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models
Shangqing Tu
Zhuoran Pan
Wenxuan Wang
Zhexin Zhang
Yuliang Sun
Jifan Yu
Hongning Wang
Lei Hou
Juanzi Li
ALM
94
0
0
17 Jun 2024
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
Yongting Zhang
Lu Chen
Guodong Zheng
Yifeng Gao
Rui Zheng
...
Yu Qiao
Xuanjing Huang
Feng Zhao
Tao Gui
Jing Shao
VLM
228
33
0
17 Jun 2024
garak: A Framework for Security Probing Large Language Models
garak: A Framework for Security Probing Large Language Models
Leon Derczynski
Erick Galinkin
Jeffrey Martin
Subho Majumdar
Nanna Inie
AAMLELM
95
20
0
16 Jun 2024
Threat Modelling and Risk Analysis for Large Language Model
  (LLM)-Powered Applications
Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications
Stephen Burabari Tete
105
7
0
16 Jun 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space
  Analysis
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
Pengfei He
Han Xu
Yue Xing
Makoto Yamada
Hui Liu
Jiliang Tang
84
17
0
16 Jun 2024
Emerging Safety Attack and Defense in Federated Instruction Tuning of
  Large Language Models
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
Rui Ye
Jingyi Chai
Xiangrui Liu
Yaodong Yang
Yanfeng Wang
Siheng Chen
AAML
154
10
0
15 Jun 2024
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large
  Language Models
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
Wenjing Zhang
Xuejiao Lei
Zhaoxiang Liu
Meijuan An
Bikun Yang
Kaikai Zhao
Kai Wang
Shiguo Lian
ELM
99
8
0
14 Jun 2024
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Zhao Xu
Fan Liu
Hao Liu
AAML
126
16
0
13 Jun 2024
Understanding Jailbreak Success: A Study of Latent Space Dynamics in
  Large Language Models
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Sarah Ball
Frauke Kreuter
Nina Rimsky
87
18
0
13 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
163
12
0
13 Jun 2024
Previous
123...121314...212223
Next