Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.02855
Cited By
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
3 July 2024
Zhexin Zhang
Junxiao Yang
Pei Ke
Shiyao Cui
Shiyao Cui
Chujie Zheng
Hongning Wang
Minlie Huang
MU
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks"
21 / 21 papers shown
Title
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
Zhehao Zhang
Weijie Xu
Fanyou Wu
Chandan K. Reddy
29
0
0
12 May 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An
Shiyue Zhang
Mark Dredze
61
0
0
25 Apr 2025
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
Yu Li
Han Jiang
Zhihua Wei
AAML
41
0
0
18 Apr 2025
Teaching Large Language Models to Reason through Learning and Forgetting
Tianwei Ni
Allen Nie
Sapana Chaudhary
Yao Liu
Huzefa Rangwala
Rasool Fakoor
ReLM
CLL
LRM
163
0
0
15 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
43
0
0
13 Apr 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
51
2
0
05 Mar 2025
Improving LLM Safety Alignment with Dual-Objective Optimization
Xuandong Zhao
Will Cai
Tianneng Shi
David Huang
Licong Lin
Song Mei
Dawn Song
AAML
MU
69
1
0
05 Mar 2025
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Qianli Ma
Dongrui Liu
Qian Chen
Linfeng Zhang
Jing Shao
MoMe
168
0
0
24 Feb 2025
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Zhexin Zhang
Leqi Lei
Junxiao Yang
Xijie Huang
Yida Lu
...
Xianqi Lei
Changzai Pan
Lei Sha
Han Wang
Minlie Huang
AAML
48
0
0
24 Feb 2025
A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models
Jiahui Geng
Qing Li
Herbert Woisetschlaeger
Zongxiong Chen
Yansen Wang
Preslav Nakov
Hans-Arno Jacobsen
Fakhri Karray
MU
41
1
0
22 Feb 2025
Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Haokun Chen
Sebastian Szyller
Weilin Xu
N. Himayat
MU
AAML
43
0
0
20 Feb 2025
OpenGU: A Comprehensive Benchmark for Graph Unlearning
Bowen Fan
Yuming Ai
Xunkai Li
Zhilin Guo
Rong-Hua Li
Guoren Wang
MU
43
1
0
06 Jan 2025
Fast Convergence of
Φ
Φ
Φ
-Divergence Along the Unadjusted Langevin Algorithm and Proximal Sampler
Siddharth Mitra
Andre Wibisono
60
0
0
14 Oct 2024
A Closer Look at Machine Unlearning for Large Language Models
Xiaojian Yuan
Tianyu Pang
Chao Du
Kejiang Chen
Weiming Zhang
Min-Bin Lin
MU
43
5
0
10 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
46
6
0
04 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language Models
Amrita Bhattacharjee
Shaona Ghosh
Traian Rebedea
Christopher Parisien
LLMSV
34
4
0
02 Oct 2024
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Huanqian Wang
Yang Yue
Rui Lu
Jingxin Shi
Andrew Zhao
Shenzhi Wang
Shiji Song
Gao Huang
LM&Ro
KELM
53
6
0
11 Jul 2024
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Ruiqi Zhang
Licong Lin
Yu Bai
Song Mei
MU
66
131
0
08 Apr 2024
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang
AAML
SILM
49
25
0
09 Feb 2024
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
48
42
0
16 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
117
303
0
19 Sep 2023
1