Adversarial Fine-Tuning of Language Models: An Iterative Optimisation
Approach for the Generation and Detection of Problematic Content

Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content

26 August 2023

Charles OÑeill

Jack Miller

Papers citing "Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content"

10 / 10 papers shown

Title
Shh, don't say that! Domain Certification in LLMs Cornelius Emde Alasdair Paren Preetham Arvind Maxime Kayser Tom Rainforth Thomas Lukasiewicz Guohao Li Philip H. S. Torr Adel Bibi 53 1 0 26 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models Leyla Naz Candogan Yongtao Wu Elias Abad Rocamora Grigorios G. Chrysos V. Cevher AAML 51 0 0 24 Feb 2025
Poisoning Language Models During Instruction Tuning Alexander Wan Eric Wallace Sheng Shen Dan Klein SILM 92 124 0 01 May 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai ... Nicholas Joseph Sam McCandlish C. Olah Jared Kaplan Jack Clark 225 446 0 23 Aug 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 313 11,953 0 04 Mar 2022
A Survey of Toxic Comment Classification Methods Kehan Wang Jiaxi Yang Hongjun Wu 18 8 0 13 Dec 2021
Analyzing Dynamic Adversarial Training Data in the Limit Eric Wallace Adina Williams Robin Jia Douwe Kiela 198 30 0 16 Oct 2021
Challenges in Detoxifying Language Models Johannes Welbl Amelia Glaese J. Uesato Sumanth Dathathri John F. J. Mellor Lisa Anne Hendricks Kirsty Anderson Pushmeet Kohli Ben Coppin Po-Sen Huang LM&MA 250 193 0 15 Sep 2021
Machine Learning Suites for Online Toxicity Detection David A. Noever 93 33 0 03 Oct 2018
NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task Rui Wang Benjamin Marie Masao Utiyama Eiichiro Sumita 16 4 0 19 Sep 2018