ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 652 papers shown
Title
BadRobot: Jailbreaking Embodied LLMs in the Physical World
BadRobot: Jailbreaking Embodied LLMs in the Physical World
Hangtao Zhang
Chenyu Zhu
Xianlong Wang
Ziqi Zhou
Yichen Wang
...
Shengshan Hu
Leo Yu Zhang
Aishan Liu
Peijin Guo
Leo Yu Zhang
LM&Ro
55
8
0
16 Jul 2024
Uncertainty is Fragile: Manipulating Uncertainty in Large Language
  Models
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
Qingcheng Zeng
Mingyu Jin
Qinkai Yu
Zhenting Wang
Wenyue Hua
...
Felix Juefei Xu
Kaize Ding
Fan Yang
Ruixiang Tang
Yongfeng Zhang
AAML
44
10
0
15 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled
  Refusal Training
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
45
19
0
12 Jul 2024
ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy
  Browser Proxies
ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy Browser Proxies
Dzung Pham
J. Sheffey
Chau Minh Pham
Amir Houmansadr
40
1
0
11 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
45
9
0
10 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges
  and Lessons Learned (Survey)
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
K. Kenthapadi
M. Sameki
Ankur Taly
HILM
ELM
AILaw
44
12
0
10 Jul 2024
LIONs: An Empirically Optimized Approach to Align Language Models
LIONs: An Empirically Optimized Approach to Align Language Models
Xiao Yu
Qingyang Wu
Yu Li
Zhou Yu
ALM
40
3
0
09 Jul 2024
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via
  Knowledge-Enhanced Logical Reasoning
R2R^2R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Bo-wen Li
LRM
46
12
0
08 Jul 2024
Large Language Model as an Assignment Evaluator: Insights, Feedback, and
  Challenges in a 1000+ Student Course
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course
Cheng-Han Chiang
Wei-Chih Chen
Chun-Yi Kuan
Chienchou Yang
Hung-yi Lee
ELM
AI4Ed
49
5
0
07 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
42
85
0
05 Jul 2024
Single Character Perturbations Break LLM Alignment
Single Character Perturbations Break LLM Alignment
Leon Lin
Hannah Brown
Kenji Kawaguchi
Michael Shieh
AAML
209
2
0
03 Jul 2024
SOS! Soft Prompt Attack Against Open-Source Large Language Models
SOS! Soft Prompt Attack Against Open-Source Large Language Models
Ziqing Yang
Michael Backes
Yang Zhang
Ahmed Salem
AAML
40
6
0
03 Jul 2024
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts
  Discovery from Large-Scale Human-LLM Conversational Datasets
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
Zhihua Jin
Shiyi Liu
Haotian Li
Xun Zhao
Huamin Qu
50
3
0
03 Jul 2024
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content
  Moderation of Large Language Models
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy
Pedro M. Esperança
Silviu Vlad Oprea
Mete Ozay
KELM
36
2
0
03 Jul 2024
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
Zhexin Zhang
Junxiao Yang
Pei Ke
Pei Ke
Shiyao Cui
Chujie Zheng
Hongning Wang
Minlie Huang
MU
AAML
67
27
0
03 Jul 2024
Purple-teaming LLMs with Adversarial Defender Training
Purple-teaming LLMs with Adversarial Defender Training
Jingyan Zhou
Kun Li
Junan Li
Jiawen Kang
Minda Hu
Xixin Wu
Helen Meng
AAML
36
1
0
01 Jul 2024
Enhancing the Capability and Robustness of Large Language Models through
  Reinforcement Learning-Driven Query Refinement
Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
Zisu Huang
Xiaohua Wang
Feiran Zhang
Zhibo Xu
Cenyuan Zhang
Xiaoqing Zheng
Xuanjing Huang
AAML
LRM
40
4
0
01 Jul 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
43
30
0
28 Jun 2024
Virtual Context: Enhancing Jailbreak Attacks with Special Token
  Injection
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
Yuqi Zhou
Lin Lu
Hanchi Sun
Pan Zhou
Lichao Sun
39
10
0
28 Jun 2024
Rethinking harmless refusals when fine-tuning foundation models
Rethinking harmless refusals when fine-tuning foundation models
Florin Pop
Judd Rosenblatt
Diogo Schwerz de Lucena
Michael Vaiana
18
0
0
27 Jun 2024
Revealing Fine-Grained Values and Opinions in Large Language Models
Revealing Fine-Grained Values and Opinions in Large Language Models
Dustin Wright
Arnav Arora
Nadav Borenstein
Srishti Yadav
Serge J. Belongie
Isabelle Augenstein
41
1
0
27 Jun 2024
FernUni LLM Experimental Infrastructure (FLEXI) -- Enabling
  Experimentation and Innovation in Higher Education Through Access to Open
  Large Language Models
FernUni LLM Experimental Infrastructure (FLEXI) -- Enabling Experimentation and Innovation in Higher Education Through Access to Open Large Language Models
Torsten Zesch
Michael Hanses
Niels Seidel
Piush Aggarwal
Dirk Veiel
Claudia de Witt
31
0
0
27 Jun 2024
A Survey on Privacy Attacks Against Digital Twin Systems in AI-Robotics
A Survey on Privacy Attacks Against Digital Twin Systems in AI-Robotics
Ivan A. Fernandez
Subash Neupane
Trisha Chakraborty
Shaswata Mitra
Sudip Mittal
Nisha Pillai
Jingdao Chen
Shahram Rahimi
52
1
0
27 Jun 2024
Jailbreaking LLMs with Arabic Transliteration and Arabizi
Jailbreaking LLMs with Arabic Transliteration and Arabizi
Mansour Al Ghanim
Saleh Almohaimeed
Mengxin Zheng
Yan Solihin
Qian Lou
42
2
0
26 Jun 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences
  to Reduce Harm
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
Beyza Ermis
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
40
28
0
26 Jun 2024
Poisoned LangChain: Jailbreak LLMs by LangChain
Poisoned LangChain: Jailbreak LLMs by LangChain
Ziqiu Wang
Jun Liu
Shengkai Zhang
Yang Yang
36
7
0
26 Jun 2024
SafeAligner: Safety Alignment against Jailbreak Attacks via Response
  Disparity Guidance
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang
Wanxu Zhao
Rui Zheng
Huijie Lv
Shihan Dou
...
Junjie Ye
Yuming Yang
Tao Gui
Qi Zhang
Xuanjing Huang
LLMSV
AAML
52
7
0
26 Jun 2024
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in
  Retrieval-Augmented Generative Models
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
Zhen Tan
Chengshuai Zhao
Raha Moraffah
Yifan Li
Song Wang
Jundong Li
Tianlong Chen
Huan Liu
SILM
54
19
0
26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large
  Language and Vision-Language Models
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
38
26
0
26 Jun 2024
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large
  Language Models via Opposite Prompt Optimization
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
Zhengyue Zhao
Xiaoyun Zhang
Kaidi Xu
Xing Hu
Rui Zhang
Zidong Du
Qi Guo
Yunji Chen
30
6
0
24 Jun 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
Siyuan Wang
Zhuohan Long
Zhihao Fan
Zhongyu Wei
42
7
0
21 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of
  Language Models
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
65
18
0
21 Jun 2024
Pareto-Optimal Learning from Preferences with Hidden Context
Pareto-Optimal Learning from Preferences with Hidden Context
Ryan Boldi
Li Ding
Lee Spector
S. Niekum
70
6
0
21 Jun 2024
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Anton Xue
Avishree Khare
Rajeev Alur
Surbhi Goel
Eric Wong
61
2
0
21 Jun 2024
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Hasan Hammoud
Umberto Michieli
Fabio Pizzati
Philip Torr
Adel Bibi
Guohao Li
Mete Ozay
MoMe
31
15
0
20 Jun 2024
Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference
  Attacks in Text-to-SQL Systems
Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
Đorđe Klisura
Anthony Rios
AAML
24
1
0
20 Jun 2024
Adversaries Can Misuse Combinations of Safe Models
Adversaries Can Misuse Combinations of Safe Models
Erik Jones
Anca Dragan
Jacob Steinhardt
50
7
0
20 Jun 2024
Prompt Injection Attacks in Defended Systems
Prompt Injection Attacks in Defended Systems
Daniil Khomsky
Narek Maloyan
Bulat Nutfullin
AAML
SILM
38
3
0
20 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
86
3
0
19 Jun 2024
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in
  LLM Text Generation
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
Xiaoze Liu
Ting Sun
Tianyang Xu
Feijie Wu
Cunxiang Wang
Xiaoqian Wang
Jing Gao
AAML
DeLMO
AILaw
56
16
0
18 Jun 2024
[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Abhinav Rao
Monojit Choudhury
Somak Aditya
26
0
0
18 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
44
8
0
17 Jun 2024
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective
  Unlearning in LLMs
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs
S. Kadhe
Farhan Ahmed
Dennis Wei
Nathalie Baracaldo
Inkit Padhi
MoMe
MU
28
7
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
50
138
0
17 Jun 2024
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
Shangqing Tu
Zhuoran Pan
Wenxuan Wang
Zhexin Zhang
Yuliang Sun
Jifan Yu
Hongning Wang
Lei Hou
Juanzi Li
ALM
47
1
0
17 Jun 2024
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Yihuai Hong
Lei Yu
Shauli Ravfogel
Haiqin Yang
Mor Geva
KELM
MU
68
18
0
17 Jun 2024
$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with
  Sparse Mixture-of-Experts
MoE-RBench\texttt{MoE-RBench}MoE-RBench: Towards Building Reliable Language Models with Sparse Mixture-of-Experts
Guanjie Chen
Xinyu Zhao
Tianlong Chen
Yu Cheng
MoE
83
5
0
17 Jun 2024
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Jiayi Mao
Xueqi Cheng
AAML
47
10
0
17 Jun 2024
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness
  Evaluation in Large Language Models
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models
Yuqing Wang
Yun Zhao
LRM
AAML
ELM
27
1
0
16 Jun 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space
  Analysis
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
Pengfei He
Han Xu
Yue Xing
Makoto Yamada
Hui Liu
Jiliang Tang
34
11
0
16 Jun 2024
Previous
123...678...121314
Next