Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.02483
Cited By
Jailbroken: How Does LLM Safety Training Fail?
5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Jailbroken: How Does LLM Safety Training Fail?"
50 / 640 papers shown
Title
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
Tong Liu
Yingjie Zhang
Zhe Zhao
Yinpeng Dong
Guozhu Meng
Kai Chen
AAML
51
44
0
28 Feb 2024
Adversarial Math Word Problem Generation
Roy Xie
Chengxuan Huang
Junlin Wang
Bhuwan Dhingra
AAML
33
1
0
27 Feb 2024
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
Zhenting Qi
Hanlin Zhang
Eric Xing
Sham Kakade
Hima Lakkaraju
SILM
44
18
0
27 Feb 2024
Securing Reliability: A Brief Overview on Enhancing In-Context Learning for Foundation Models
Yunpeng Huang
Yaonan Gu
Jingwei Xu
Zhihong Zhu
Zhaorun Chen
Xiaoxing Ma
40
3
0
27 Feb 2024
Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
Zhenhong Zhou
Jiuyang Xiang
Haopeng Chen
Quan Liu
Zherui Li
Sen Su
37
19
0
27 Feb 2024
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu
Kai Zhang
Yuan Li
Zhiling Yan
Chujie Gao
...
Yue Huang
Hanchi Sun
Jianfeng Gao
Lifang He
Lichao Sun
VLM
VGen
EGVM
75
260
0
27 Feb 2024
WIPI: A New Web Threat for LLM-Driven Web Agents
Fangzhou Wu
Shutong Wu
Yulong Cao
Chaowei Xiao
LLMAG
34
20
0
26 Feb 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
42
57
0
26 Feb 2024
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
Paul Röttger
Valentin Hofmann
Valentina Pyatkin
Musashi Hinck
Hannah Rose Kirk
Hinrich Schütze
Dirk Hovy
ELM
26
53
0
26 Feb 2024
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
Huijie Lv
Xiao Wang
Yuan Zhang
Caishuang Huang
Shihan Dou
Junjie Ye
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
44
29
0
26 Feb 2024
Defending LLMs against Jailbreaking Attacks via Backtranslation
Yihan Wang
Zhouxing Shi
Andrew Bai
Cho-Jui Hsieh
AAML
40
33
0
26 Feb 2024
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
Daoyuan Wu
Shuaibao Wang
Yang Liu
Ning Liu
AAML
39
7
0
24 Feb 2024
Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology
Zhenhua Wang
Wei Xie
Baosheng Wang
Enze Wang
Zhiwen Gui
Shuoyoucheng Ma
Kai Chen
36
14
0
24 Feb 2024
Fast Adversarial Attacks on Language Models In One GPU Minute
Vinu Sankar Sadasivan
Shoumik Saha
Gaurang Sriramanan
Priyatham Kattakinda
Atoosa Malemir Chegini
S. Feizi
MIALM
43
34
0
23 Feb 2024
How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
Somnath Banerjee
Sayan Layek
Rima Hazra
Animesh Mukherjee
29
11
0
23 Feb 2024
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
Heegyu Kim
Sehyun Yuk
Hyunsouk Cho
AAML
41
16
0
23 Feb 2024
A Conversational Brain-Artificial Intelligence Interface
Anja Meunier
Michal Robert Zák
Lucas Munz
Sofiya Garkot
Manuel Eder
Jiachen Xu
Moritz Grosse-Wentrup
40
0
0
22 Feb 2024
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping
Alex Stein
Manli Shu
Khalid Saifullah
Yuxin Wen
Tom Goldstein
AAML
48
43
0
21 Feb 2024
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
Vyas Raina
Adian Liusie
Mark J. F. Gales
AAML
ELM
32
53
0
21 Feb 2024
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
Federico Bianchi
James Zou
32
4
0
21 Feb 2024
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
Zihao Xu
Yi Liu
Gelei Deng
Yuekang Li
S. Picek
PILM
AAML
41
35
0
21 Feb 2024
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative
Zhen Tan
Chengshuai Zhao
Raha Moraffah
Yifan Li
Yu Kong
Tianlong Chen
Huan Liu
41
15
0
20 Feb 2024
Is the System Message Really Important to Jailbreaks in Large Language Models?
Xiaotian Zou
Yongkang Chen
Ke Li
27
13
0
20 Feb 2024
Generative AI Security: Challenges and Countermeasures
Banghua Zhu
Norman Mu
Jiantao Jiao
David Wagner
AAML
SILM
61
8
0
20 Feb 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou
Yufei Han
Haomin Zhuang
Kehan Guo
Zhenwen Liang
Hongyan Bao
Xiangliang Zhang
LLMAG
AAML
42
11
0
20 Feb 2024
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
Aiwei Liu
Haoping Bai
Zhiyun Lu
Xiang Kong
Simon Wang
Jiulong Shan
Mengsi Cao
Lijie Wen
ALM
34
12
0
19 Feb 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
49
86
0
19 Feb 2024
How Susceptible are Large Language Models to Ideological Manipulation?
Kai Chen
Zihao He
Jun Yan
Taiwei Shi
Kristina Lerman
40
10
0
18 Feb 2024
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
Junjie Ye
Sixian Li
Guanyu Li
Caishuang Huang
Songyang Gao
Yilong Wu
Qi Zhang
Tao Gui
Xuanjing Huang
LLMAG
38
16
0
16 Feb 2024
Recovering the Pre-Fine-Tuning Weights of Generative Models
Eliahu Horwitz
Jonathan Kahana
Yedid Hoshen
50
10
0
15 Feb 2024
A StrongREJECT for Empty Jailbreaks
Alexandra Souly
Qingyuan Lu
Dillon Bowen
Tu Trinh
Elvis Hsieh
...
Pieter Abbeel
Justin Svegliato
Scott Emmons
Olivia Watkins
Sam Toyer
33
67
0
15 Feb 2024
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
Timothy R. McIntosh
Teo Susnjak
Tong Liu
Paul Watters
Malka N. Halgamuge
ALM
ELM
64
51
0
15 Feb 2024
PAL: Proxy-Guided Black-Box Attack on Large Language Models
Chawin Sitawarin
Norman Mu
David Wagner
Alexandre Araujo
ELM
29
29
0
15 Feb 2024
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
Yixin Cheng
Markos Georgopoulos
V. Cevher
Grigorios G. Chrysos
AAML
27
15
0
14 Feb 2024
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
Zhiyuan Chang
Mingyang Li
Yi Liu
Junjie Wang
Qing Wang
Yang Liu
94
38
0
14 Feb 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Jinyuan Jia
Bill Yuchen Lin
Radha Poovendran
AAML
131
86
0
14 Feb 2024
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILaw
MU
75
84
0
13 Feb 2024
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xing-ming Guo
Fangxu Yu
Huan Zhang
Lianhui Qin
Bin Hu
AAML
117
70
0
13 Feb 2024
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
Xiangming Gu
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Ye Wang
Jing Jiang
Min-Bin Lin
LLMAG
LM&Ro
37
49
0
13 Feb 2024
Lying Blindly: Bypassing ChatGPT's Safeguards to Generate Hard-to-Detect Disinformation Claims at Scale
Freddy Heppell
M. Bakir
Kalina Bontcheva
DeLMO
33
1
0
13 Feb 2024
Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
Gelei Deng
Yi Liu
Kailong Wang
Yuekang Li
Tianwei Zhang
Yang Liu
26
43
0
13 Feb 2024
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
Wei Zou
Runpeng Geng
Binghui Wang
Jinyuan Jia
SILM
39
18
1
12 Feb 2024
Whispers in the Machine: Confidentiality in LLM-integrated Systems
Jonathan Evertz
Merlin Chlosta
Lea Schonherr
Thorsten Eisenhofer
74
17
0
10 Feb 2024
StruQ: Defending Against Prompt Injection with Structured Queries
Sizhe Chen
Julien Piet
Chawin Sitawarin
David Wagner
SILM
AAML
30
67
0
09 Feb 2024
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang
AAML
SILM
49
25
0
09 Feb 2024
Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu
Yugeng Liu
Ziqing Yang
Xinyue Shen
Michael Backes
Yang Zhang
AAML
37
67
0
08 Feb 2024
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia
Guangyu Shen
Shuyang Cheng
Kai-xian Zhang
Guanhong Tao
Shengwei An
Lu Yan
Zhuo Zhang
Shiqing Ma
Xiangyu Zhang
20
10
0
08 Feb 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
60
79
0
07 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
30
87
0
07 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
26
320
0
06 Feb 2024
Previous
1
2
3
...
10
11
12
13
9
Next