Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.17196
Cited By
Shape it Up! Restoring LLM Safety during Finetuning
22 May 2025
ShengYun Peng
Pin-Yu Chen
Jianfeng Chi
Seongmin Lee
Duen Horng Chau
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Shape it Up! Restoring LLM Safety during Finetuning"
8 / 8 papers shown
Title
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
Pin-Yu Chen
Han Shen
Payel Das
Tianyi Chen
61
2
0
24 Mar 2025
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Weixiang Zhao
Yulin Hu
Yang Deng
Jiahe Guo
Xingyu Sui
...
An Zhang
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
91
5
0
28 Feb 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
92
16
0
24 Feb 2025
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
Seanie Lee
Dong Bok Lee
Dominik Wagner
Minki Kang
Haebin Seong
Tobias Bocklet
Juho Lee
Sung Ju Hwang
38
2
0
18 Feb 2025
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
...
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
AI4MH
43
407
0
07 Dec 2023
Fine-tuning can cripple your foundation model; preserving features may be the solution
Jishnu Mukhoti
Y. Gal
Philip Torr
P. Dokania
CLL
56
37
0
25 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
133
1,376
0
27 Jul 2023
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
181
18,685
0
20 Jul 2017
1