The effect of fine-tuning on language model toxicity

21 October 2024

Papers citing "The effect of fine-tuning on language model toxicity"

7 / 7 papers shown

Title
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study Kaustubh Ponkshe Shaan Shah Raghav Singhal Praneeth Vepakomma 106 0 0 20 May 2025
Fine-tuning Language Models for Factuality Katherine Tian Eric Mitchell Huaxiu Yao Christopher D. Manning Chelsea Finn KELM HILM SyDa 73 179 0 14 Nov 2023
Removing RLHF Protections in GPT-4 via Fine-Tuning Qiusi Zhan Richard Fang R. Bindu Akul Gupta Tatsunori Hashimoto Daniel Kang MU AAML 58 101 0 09 Nov 2023
The Expressive Power of Low-Rank Adaptation Yuchen Zeng Kangwook Lee 96 62 0 26 Oct 2023
Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information? A. Sun Eliott Zemour Arushi Saxena Udith Vaidyanathan Eric Lin Christian Lau Vaikkunth Mugunthan SILM 85 21 0 31 Jul 2023
On the Effectiveness of Parameter-Efficient Fine-Tuning Z. Fu Haoran Yang Anthony Man-Cho So Wai Lam Lidong Bing Nigel Collier 68 158 0 28 Nov 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 880 12,973 0 04 Mar 2022