ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 636 papers shown
Title
"I Always Felt that Something Was Wrong.": Understanding Compliance
  Risks and Mitigation Strategies when Professionals Use Large Language Models
"I Always Felt that Something Was Wrong.": Understanding Compliance Risks and Mitigation Strategies when Professionals Use Large Language Models
Siying Hu
Piaohong Wang
Yaxing Yao
Zhicong Lu
AILaw
PILM
42
0
0
07 Nov 2024
Diversity Helps Jailbreak Large Language Models
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
147
0
0
06 Nov 2024
Defining and Evaluating Physical Safety for Large Language Models
Defining and Evaluating Physical Safety for Large Language Models
Yung-Chen Tang
Pin-Yu Chen
Tsung-Yi Ho
ELM
32
2
0
04 Nov 2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Sviatoslav Chalnev
Matthew Siu
Arthur Conmy
LLMSV
55
16
0
04 Nov 2024
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse
  Activation Control
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Yuxin Xiao
Chaoqun Wan
Yonggang Zhang
Wenxiao Wang
Binbin Lin
Xiaofei He
Xu Shen
Jieping Ye
29
0
0
04 Nov 2024
Achieving Domain-Independent Certified Robustness via Knowledge
  Continuity
Achieving Domain-Independent Certified Robustness via Knowledge Continuity
Alan Sun
Chiyu Ma
Kenneth Ge
Soroush Vosoughi
36
0
0
03 Nov 2024
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao
Kejiang Chen
Wenbo Zhang
Nenghai Yu
AAML
40
0
0
03 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch
Constantin Weisser
Severin Field
Helen Yannakoudakis
Stephen Casper
39
2
0
02 Nov 2024
Plentiful Jailbreaks with String Compositions
Plentiful Jailbreaks with String Compositions
Brian R. Y. Huang
AAML
46
2
0
01 Nov 2024
Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Zhipeng Wei
Yuqi Liu
N. Benjamin Erichson
AAML
53
1
0
01 Nov 2024
RESTOR: Knowledge Recovery through Machine Unlearning
RESTOR: Knowledge Recovery through Machine Unlearning
Keivan Rezaei
Khyathi Raghavi Chandu
S. Feizi
Yejin Choi
Faeze Brahman
Abhilasha Ravichander
KELM
CLL
MU
58
0
0
31 Oct 2024
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs
Muhammed Saeed
Elgizouli Mohamed
Mukhtar Mohamed
Shaina Raza
Muhammad Abdul-Mageed
Shady Shehata
43
0
0
31 Oct 2024
Responsible Retrieval Augmented Generation for Climate Decision Making
  from Documents
Responsible Retrieval Augmented Generation for Climate Decision Making from Documents
Matyas Juhasz
Kalyan Dutia
Henry Franks
Conor Delahunty
Patrick Fawbert Mills
Harrison Pim
34
1
0
31 Oct 2024
ProTransformer: Robustify Transformers via Plug-and-Play Paradigm
ProTransformer: Robustify Transformers via Plug-and-Play Paradigm
Zhichao Hou
Weizhi Gao
Yuchen Shen
Feiyi Wang
Xiaorui Liu
VLM
30
2
0
30 Oct 2024
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Catherine Arnett
Eliot Jones
Ivan P. Yamshchikov
Pierre-Carl Langlais
33
2
0
29 Oct 2024
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to
  Jailbreak LLMs with Higher Success Rates in Fewer Attempts
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
Vishal Kumar
Zeyi Liao
Jaylen Jones
Huan Sun
AAML
23
2
0
29 Oct 2024
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and
  Prompt Types
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
Yutao Mou
Shikun Zhang
Wei Ye
ELM
40
8
0
29 Oct 2024
Enhancing Adversarial Attacks through Chain of Thought
Enhancing Adversarial Attacks through Chain of Thought
Jingbo Su
LRM
26
2
0
29 Oct 2024
RobustKV: Defending Large Language Models against Jailbreak Attacks via
  KV Eviction
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
Tanqiu Jiang
Zian Wang
Jiacheng Liang
Changjiang Li
Yuhui Wang
Ting Wang
AAML
34
3
0
25 Oct 2024
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in
  Conversational LLMs
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs
Zhiting Fan
Ruizhe Chen
Tianxiang Hu
Zuozhu Liu
26
7
0
25 Oct 2024
Adversarial Attacks on Large Language Models Using Regularized
  Relaxation
Adversarial Attacks on Large Language Models Using Regularized Relaxation
Samuel Jacob Chacko
Sajib Biswas
Chashi Mahiul Islam
Fatema Tabassum Liza
Xiuwen Liu
AAML
31
2
0
24 Oct 2024
Cross-model Control: Improving Multiple Large Language Models in
  One-time Training
Cross-model Control: Improving Multiple Large Language Models in One-time Training
Jiayi Wu
Hao Sun
Hengyi Cai
Lixin Su
S. Wang
Dawei Yin
Xiang Li
Ming Gao
MU
36
0
0
23 Oct 2024
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres
Laura Ruis
Ekdeep Singh Lubana
David M. Krueger
LLMSV
27
5
0
22 Oct 2024
Enhancing Answer Attribution for Faithful Text Generation with Large
  Language Models
Enhancing Answer Attribution for Faithful Text Generation with Large Language Models
Juraj Vladika
Luca Mülln
Florian Matthes
30
0
0
22 Oct 2024
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
Itay Nakash
George Kour
Guy Uziel
Ateret Anaby-Tavor
AAML
LLMAG
40
4
0
22 Oct 2024
LLMScan: Causal Scan for LLM Misbehavior Detection
LLMScan: Causal Scan for LLM Misbehavior Detection
Mengdi Zhang
Kai Kiat Goh
Peixin Zhang
Jun Sun
Rose Lin Xin
Hongyu Zhang
23
0
0
22 Oct 2024
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against
  Aligned Large Language Models
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Xiao-Li Li
Zhuhong Li
Qiongxiu Li
Bingze Lee
Jinghao Cui
Xiaolin Hu
AAML
27
2
0
20 Oct 2024
Imprompter: Tricking LLM Agents into Improper Tool Use
Imprompter: Tricking LLM Agents into Improper Tool Use
Xiaohan Fu
Shuheng Li
Zihan Wang
Y. Liu
Rajesh K. Gupta
Taylor Berg-Kirkpatrick
Earlence Fernandes
SILM
LLMAG
54
15
0
19 Oct 2024
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Yujun Zhou
Jingdong Yang
Kehan Guo
Pin-Yu Chen
Tian Gao
...
Tian Gao
Werner Geyer
Nuno Moniz
Nitesh V Chawla
Xiangliang Zhang
43
5
0
18 Oct 2024
SPIN: Self-Supervised Prompt INjection
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
30
0
0
17 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Fan Zhang
Yongbin Li
59
5
0
17 Oct 2024
Estimating the Probabilities of Rare Outputs in Language Models
Estimating the Probabilities of Rare Outputs in Language Models
Gabriel Wu
Jacob Hilton
AAML
UQCV
48
2
0
17 Oct 2024
POROver: Improving Safety and Reducing Overrefusal in Large Language
  Models with Overgeneration and Preference Optimization
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
40
0
0
16 Oct 2024
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
Jaehong Yoon
Shoubin Yu
Vaidehi Patil
Huaxiu Yao
Joey Tianyi Zhou
79
15
0
16 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
68
0
0
15 Oct 2024
Cognitive Overload Attack:Prompt Injection for Long Context
Cognitive Overload Attack:Prompt Injection for Long Context
Bibek Upadhayay
Vahid Behzadan
Amin Karbasi
AAML
34
2
0
15 Oct 2024
Locking Down the Finetuned LLMs Safety
Locking Down the Finetuned LLMs Safety
Minjun Zhu
Linyi Yang
Yifan Wei
Ningyu Zhang
Yue Zhang
39
8
0
14 Oct 2024
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Yifan Luo
Zhennan Zhou
Meitan Wang
Bin Dong
26
0
0
14 Oct 2024
Fast Convergence of $Φ$-Divergence Along the Unadjusted Langevin Algorithm and Proximal Sampler
Fast Convergence of ΦΦΦ-Divergence Along the Unadjusted Langevin Algorithm and Proximal Sampler
Siddharth Mitra
Andre Wibisono
60
23
0
14 Oct 2024
Uncovering, Explaining, and Mitigating the Superficial Safety of
  Backdoor Defense
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
Rui Min
Zeyu Qin
Nevin L. Zhang
Li Shen
Minhao Cheng
AAML
39
4
0
13 Oct 2024
Survival of the Safest: Towards Secure Prompt Optimization through
  Interleaved Multi-Objective Evolution
Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution
Ankita Sinha
Wendi Cui
Kamalika Das
Jiaxin Zhang
AAML
28
2
0
12 Oct 2024
Are You Human? An Adversarial Benchmark to Expose LLMs
Are You Human? An Adversarial Benchmark to Expose LLMs
Gilad Gressel
Rahul Pankajakshan
Yisroel Mirsky
DeLMO
38
0
0
12 Oct 2024
Can a large language model be a gaslighter?
Can a large language model be a gaslighter?
Wei Li
Luyao Zhu
Yang Song
Ruixi Lin
Rui Mao
Yang You
26
0
0
11 Oct 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention
  Manipulation
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Yunhong Wang
Cihang Xie
32
5
0
11 Oct 2024
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent
  Enhanced Explanation Evaluation Framework
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
Fan Liu
Yue Feng
Zhao Xu
Lixin Su
Xinyu Ma
Dawei Yin
Hao Liu
ELM
32
7
0
11 Oct 2024
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt
  Decomposition Process
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
Xiaogeng Liu
Chaowei Xiao
AAML
34
3
0
11 Oct 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Priyanshu Kumar
Elaine Lau
Saranya Vijayakumar
Tu Trinh
Scale Red Team
...
Sean Hendryx
Shuyan Zhou
Matt Fredrikson
Summer Yue
Zifan Wang
LLMAG
34
17
0
11 Oct 2024
Do Unlearning Methods Remove Information from Language Model Weights?
Do Unlearning Methods Remove Information from Language Model Weights?
Aghyad Deeb
Fabien Roger
AAML
MU
44
14
0
11 Oct 2024
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty
  Simulations
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations
Nathalie Maria Kirch
Konstantin Hebenstreit
Matthias Samwald
30
1
0
10 Oct 2024
How Does Vision-Language Adaptation Impact the Safety of Vision Language
  Models?
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
Seongyun Lee
Geewook Kim
Jiyeon Kim
Hyunji Lee
Hoyeon Chang
Sue Hyun Park
Minjoon Seo
33
0
0
10 Oct 2024
Previous
12345...111213
Next