Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2308.13768
Cited By
Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content
26 August 2023
Charles OÑeill
Jack Miller
I. Ciucă
Y. Ting 丁
Thang Bui
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content"
10 / 10 papers shown
Title
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Guohao Li
Philip H. S. Torr
Adel Bibi
53
1
0
26 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
V. Cevher
AAML
51
0
0
24 Feb 2025
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
92
124
0
01 May 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
225
446
0
23 Aug 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,953
0
04 Mar 2022
A Survey of Toxic Comment Classification Methods
Kehan Wang
Jiaxi Yang
Hongjun Wu
18
8
0
13 Dec 2021
Analyzing Dynamic Adversarial Training Data in the Limit
Eric Wallace
Adina Williams
Robin Jia
Douwe Kiela
198
30
0
16 Oct 2021
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
250
193
0
15 Sep 2021
Machine Learning Suites for Online Toxicity Detection
David A. Noever
93
33
0
03 Oct 2018
NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task
Rui Wang
Benjamin Marie
Masao Utiyama
Eiichiro Sumita
16
4
0
19 Sep 2018
1