Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.18641
Cited By
v1
v2
v3
v4 (latest)
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
28 May 2024
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
Re-assign community
ArXiv (abs)
PDF
HTML
Github (21★)
Papers citing
"Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning"
28 / 28 papers shown
Title
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe
Shaan Shah
Raghav Singhal
Praneeth Vepakomma
115
0
0
20 May 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
129
59
0
20 Jan 2025
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
Mingjie Li
Wai Man Si
Michael Backes
Yang Zhang
Yisen Wang
100
18
0
03 Jan 2025
On Evaluating the Durability of Safeguards for Open-Weight LLMs
Xiangyu Qi
Boyi Wei
Nicholas Carlini
Yangsibo Huang
Tinghao Xie
Luxi He
Matthew Jagielski
Milad Nasr
Prateek Mittal
Peter Henderson
AAML
137
22
0
10 Dec 2024
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
Shenghui Li
Edith C.H. Ngai
Fanghua Ye
Thiemo Voigt
SILM
159
6
0
28 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
114
7
0
17 Nov 2024
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Samuele Poppi
Zheng-Xin Yong
Yifei He
Bobbie Chern
Han Zhao
Aobo Yang
Jianfeng Chi
AAML
142
21
0
23 Oct 2024
Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning
H. Fernando
Han Shen
Parikshit Ram
Yi Zhou
Horst Samulowitz
Nathalie Baracaldo
Tianyi Chen
CLL
160
4
0
20 Oct 2024
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
Guozhi Liu
Weiwei Lin
Tiansheng Huang
Ruichao Mo
Qi Mu
Li Shen
AAML
112
17
0
13 Oct 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
123
63
0
01 Aug 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
88
35
0
28 Jun 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
130
55
0
27 May 2024
A Survey on Large Language Model-Based Game Agents
Sihao Hu
Tiansheng Huang
Gaowen Liu
Ramana Rao Kompella
Gaowen Liu
Selim Furkan Tekin
Yichang Xu
Zachary Yahn
Ling Liu
LLMAG
LM&Ro
AI4CE
LM&MA
198
57
0
02 Apr 2024
Removing RLHF Protections in GPT-4 via Fine-Tuning
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MU
AAML
87
104
0
09 Nov 2023
Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives
Sihao Hu
Tiansheng Huang
Fatih İlhan
Selim Furkan Tekin
Ling Liu
92
55
0
02 Oct 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi
Kaixuan Huang
Ashwinee Panda
Peter Henderson
Mengdi Wang
Prateek Mittal
AAML
100
171
0
22 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
389
4,169
0
29 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Boyao Wang
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
98
468
0
13 Apr 2023
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
Zheng Yuan
Hongyi Yuan
Chuanqi Tan
Wei Wang
Songfang Huang
Feiran Huang
ALM
167
384
0
11 Apr 2023
Chain of Hindsight Aligns Language Models with Feedback
Hao Liu
Carmelo Sferrazza
Pieter Abbeel
ALM
125
124
0
06 Feb 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
891
13,228
0
04 Mar 2022
Federated Learning Based on Dynamic Regularization
D. A. E. Acar
Yue Zhao
Ramon Matas Navarro
Matthew Mattina
P. Whatmough
Venkatesh Saligrama
FedML
81
781
0
08 Nov 2021
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
365
4,598
0
27 Oct 2021
Proximal Gradient Descent-Ascent: Variable Convergence under KŁ Geometry
Ziyi Chen
Yi Zhou
Tengyu Xu
Yingbin Liang
111
35
0
09 Feb 2021
Meta-Learning with Implicit Gradients
Aravind Rajeswaran
Chelsea Finn
Sham Kakade
Sergey Levine
119
858
0
10 Sep 2019
On the Convergence of FedAvg on Non-IID Data
Xiang Li
Kaixuan Huang
Wenhao Yang
Shusen Wang
Zhihua Zhang
FedML
174
2,356
0
04 Jul 2019
Overcoming catastrophic forgetting in neural networks
J. Kirkpatrick
Razvan Pascanu
Neil C. Rabinowitz
J. Veness
Guillaume Desjardins
...
A. Grabska-Barwinska
Demis Hassabis
Claudia Clopath
D. Kumaran
R. Hadsell
CLL
374
7,587
0
02 Dec 2016
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
Hamed Karimi
J. Nutini
Mark Schmidt
280
1,222
0
16 Aug 2016
1