ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 640 papers shown
Title
LLM-GAN: Construct Generative Adversarial Network Through Large Language
  Models For Explainable Fake News Detection
LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection
Yifeng Wang
Zhouhong Gu
Siwei Zhang
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
39
0
0
03 Sep 2024
Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak
  Attacks
Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Jason Zhang
Julius Broomfield
Sara Pieri
Reihaneh Iranmanesh
Reihaneh Rabbany
Kellin Pelrine
AAML
41
6
0
29 Aug 2024
Acceptable Use Policies for Foundation Models
Acceptable Use Policies for Foundation Models
Kevin Klyman
31
14
0
29 Aug 2024
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational
  Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
Aman Priyanshu
Supriti Vijay
AAML
30
1
0
28 Aug 2024
Legilimens: Practical and Unified Content Moderation for Large Language
  Model Services
Legilimens: Practical and Unified Content Moderation for Large Language Model Services
Jialin Wu
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Jiayang Xu
Xinfeng Li
Wenyuan Xu
40
6
0
28 Aug 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAML
MU
46
40
0
27 Aug 2024
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language
  Models
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
Hongfu Liu
Yuxi Xie
Ye Wang
Michael Shieh
70
2
0
27 Aug 2024
LLM-PBE: Assessing Data Privacy in Large Language Models
LLM-PBE: Assessing Data Privacy in Large Language Models
Qinbin Li
Junyuan Hong
Chulin Xie
Jeffrey Tan
Rachel Xin
...
Dan Hendrycks
Zhangyang Wang
Bo Li
Bingsheng He
Dawn Song
ELM
PILM
40
13
0
23 Aug 2024
Towards "Differential AI Psychology" and in-context Value-driven
  Statement Alignment with Moral Foundations Theory
Towards "Differential AI Psychology" and in-context Value-driven Statement Alignment with Moral Foundations Theory
Simon Münker
SyDa
29
0
0
21 Aug 2024
EEG-Defender: Defending against Jailbreak through Early Exit Generation
  of Large Language Models
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
34
0
0
21 Aug 2024
MEGen: Generative Backdoor in Large Language Models via Model Editing
MEGen: Generative Backdoor in Large Language Models via Model Editing
Jiyang Qiu
Xinbei Ma
Zhuosheng Zhang
Hai Zhao
AAML
KELM
SILM
30
3
0
20 Aug 2024
Tracing Privacy Leakage of Language Models to Training Data via Adjusted
  Influence Functions
Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions
Jinxin Liu
Zao Yang
28
1
0
20 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak
  Attacks
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Donghai Hong
Jiaying Chen
Wenhai Wang
44
1
0
18 Aug 2024
$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and
  Defenses for Vision Language Models
MMJ-Bench\textit{MMJ-Bench}MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
Fenghua Weng
Yue Xu
Chengyan Fu
Wenjie Wang
AAML
40
1
0
16 Aug 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov
  Decision Processes and Tree Search
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
Robert J. Moss
AAML
28
0
0
11 Aug 2024
Can a Bayesian Oracle Prevent Harm from an Agent?
Can a Bayesian Oracle Prevent Harm from an Agent?
Yoshua Bengio
Michael K. Cohen
Nikolay Malkin
Matt MacDermott
Damiano Fornasiere
Pietro Greiner
Younesse Kaddar
45
4
0
09 Aug 2024
Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective
  Alignment with Contrastive Prompts
Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts
Tingchen Fu
Yupeng Hou
Julian McAuley
Rui Yan
38
3
0
09 Aug 2024
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large
  Language Models
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models
Fabio Pernisi
Dirk Hovy
Paul Röttger
54
0
0
08 Aug 2024
Multi-Turn Context Jailbreak Attack on Large Language Models From First
  Principles
Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Hui Li
AAML
34
11
0
08 Aug 2024
EnJa: Ensemble Jailbreak on Large Language Models
EnJa: Ensemble Jailbreak on Large Language Models
Jiahao Zhang
Zilong Wang
Ruofan Wang
Xingjun Ma
Yu-Gang Jiang
AAML
39
1
0
07 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
43
8
0
02 Aug 2024
Misinforming LLMs: vulnerabilities, challenges and opportunities
Misinforming LLMs: vulnerabilities, challenges and opportunities
Jaroslaw Kornowicz
Daniel Geissler
Kirsten Thommes
33
2
0
02 Aug 2024
Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion
Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion
Honglei Miao
Fan Ma
Ruijie Quan
Kun Zhan
Yi Yang
AAML
36
0
0
01 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
53
42
0
01 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
26
21
0
31 Jul 2024
Machine Unlearning in Generative AI: A Survey
Machine Unlearning in Generative AI: A Survey
Zheyuan Liu
Guangyao Dou
Zhaoxuan Tan
Yijun Tian
Meng Jiang
MU
31
14
0
30 Jul 2024
LocalValueBench: A Collaboratively Built and Extensible Benchmark for
  Evaluating Localized Value Alignment and Ethical Safety in Large Language
  Models
LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models
Achintya Gopal
Nicholas Wai Long Lau
Eva Adelina Susanto
Chi Lok Yu
Aditya Paul
ELM
25
7
0
27 Jul 2024
GermanPartiesQA: Benchmarking Commercial Large Language Models for
  Political Bias and Sycophancy
GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy
Jan Batzner
Volker Stocker
Stefan Schmid
Gjergji Kasneci
23
1
0
25 Jul 2024
The Dark Side of Function Calling: Pathways to Jailbreaking Large
  Language Models
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
Zihui Wu
Haichang Gao
Jianping He
Ping Wang
32
7
0
25 Jul 2024
Can Large Language Models Automatically Jailbreak GPT-4V?
Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu
Yue Huang
Yixin Liu
Xiang Li
Pan Zhou
Lichao Sun
SILM
40
1
0
23 Jul 2024
RedAgent: Red Teaming Large Language Models with Context-aware
  Autonomous Language Agent
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu
Wenhui Zhang
Zhibo Wang
Feng Xiao
Rui Zheng
Yunhe Feng
Zhongjie Ba
Kui Ren
AAML
LLMAG
39
11
0
23 Jul 2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
Yishuo Cai
Zhenhong Zhou
Renjie Gu
Haiqin Weng
Yan Liu
Lei Bai
Wei Xu
Han Qiu
34
4
0
23 Jul 2024
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
Shi Lin
Rongchang Li
Xun Wang
Changting Lin
Xun Wang
Wenpeng Xing
Meng Han
Meng Han
63
3
0
23 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
55
28
0
22 Jul 2024
LLMmap: Fingerprinting For Large Language Models
LLMmap: Fingerprinting For Large Language Models
Dario Pasquini
Evgenios M. Kornaropoulos
G. Ateniese
58
6
0
22 Jul 2024
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal
  Mechanisms and the Superficial Hypothesis
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
Guang-Da Liu
Haitao Mao
Jiliang Tang
K. Johnson
LRM
49
8
0
21 Jul 2024
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and
  Semantically-Rich Vision-Language Models
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
Md Zarif Hossain
Ahmed Imteaj
VLM
AAML
42
4
0
20 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
84
17
0
17 Jul 2024
The Better Angels of Machine Personality: How Personality Relates to LLM
  Safety
The Better Angels of Machine Personality: How Personality Relates to LLM Safety
Jie Zhang
Dongrui Liu
Chao Qian
Ziyue Gan
Yong-jin Liu
Yu Qiao
Jing Shao
LLMAG
PILM
53
12
0
17 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
50
27
0
16 Jul 2024
BadRobot: Jailbreaking Embodied LLMs in the Physical World
BadRobot: Jailbreaking Embodied LLMs in the Physical World
Hangtao Zhang
Chenyu Zhu
Xianlong Wang
Ziqi Zhou
Yichen Wang
...
Shengshan Hu
Leo Yu Zhang
Aishan Liu
Peijin Guo
Leo Yu Zhang
LM&Ro
53
8
0
16 Jul 2024
Uncertainty is Fragile: Manipulating Uncertainty in Large Language
  Models
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
Qingcheng Zeng
Mingyu Jin
Qinkai Yu
Zhenting Wang
Wenyue Hua
...
Felix Juefei Xu
Kaize Ding
Fan Yang
Ruixiang Tang
Yongfeng Zhang
AAML
44
10
0
15 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled
  Refusal Training
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
45
19
0
12 Jul 2024
ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy
  Browser Proxies
ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy Browser Proxies
Dzung Pham
J. Sheffey
Chau Minh Pham
Amir Houmansadr
37
1
0
11 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
45
9
0
10 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges
  and Lessons Learned (Survey)
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
K. Kenthapadi
M. Sameki
Ankur Taly
HILM
ELM
AILaw
39
12
0
10 Jul 2024
LIONs: An Empirically Optimized Approach to Align Language Models
LIONs: An Empirically Optimized Approach to Align Language Models
Xiao Yu
Qingyang Wu
Yu Li
Zhou Yu
ALM
37
3
0
09 Jul 2024
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via
  Knowledge-Enhanced Logical Reasoning
R2R^2R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Bo-wen Li
LRM
43
12
0
08 Jul 2024
Large Language Model as an Assignment Evaluator: Insights, Feedback, and
  Challenges in a 1000+ Student Course
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course
Cheng-Han Chiang
Wei-Chih Chen
Chun-Yi Kuan
Chienchou Yang
Hung-yi Lee
ELM
AI4Ed
46
5
0
07 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
39
82
0
05 Jul 2024
Previous
123...567...111213
Next