Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2109.07445
Cited By
Challenges in Detoxifying Language Models
15 September 2021
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Challenges in Detoxifying Language Models"
40 / 40 papers shown
Title
Teaching Models to Understand (but not Generate) High-risk Data
Ryan Yixiang Wang
Matthew Finlayson
Luca Soldaini
Swabha Swayamdipta
Robin Jia
109
0
0
05 May 2025
Safety in Large Reasoning Models: A Survey
Cheng Wang
Y. Liu
B. Li
Duzhen Zhang
Z. Li
Junfeng Fang
Bryan Hooi
LRM
142
1
0
24 Apr 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun-Xiong Xia
Tianyi Wu
Zhiwei Xue
Y. Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
131
13
0
30 Jan 2025
LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena
Stefan Pasch
38
0
0
04 Jan 2025
Composable Interventions for Language Models
Arinbjorn Kolbeinsson
Kyle O'Brien
Tianjin Huang
Shanghua Gao
Shiwei Liu
...
Anurag J. Vaidya
Faisal Mahmood
Marinka Zitnik
Tianlong Chen
Thomas Hartvigsen
KELM
MU
87
5
0
09 Jul 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Preetam Prabhu Srikar Dammu
Hayoung Jung
Anjali Singh
Monojit Choudhury
Tanushree Mitra
32
8
0
08 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
28
36
0
06 May 2024
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
Xuan Xie
Jiayang Song
Zhehua Zhou
Yuheng Huang
Da Song
Lei Ma
OffRL
42
6
0
12 Apr 2024
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints
Kareem Ahmed
Kai-Wei Chang
Guy Van den Broeck
26
10
0
06 Dec 2023
Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models
Luiza Amador Pozzobon
B. Ermiş
Patrick Lewis
Sara Hooker
28
20
0
11 Oct 2023
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation
Xinshuo Hu
Dongfang Li
Baotian Hu
Zihao Zheng
Zhenyu Liu
M. Zhang
KELM
MU
25
26
0
16 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
65
231
0
12 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger
Hannah Rose Kirk
Bertie Vidgen
Giuseppe Attanasio
Federico Bianchi
Dirk Hovy
ALM
ELM
AILaw
23
122
0
02 Aug 2023
RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment
Kevin Kaichuang Yang
Dan Klein
Asli Celikyilmaz
Nanyun Peng
Yuandong Tian
ALM
34
31
0
24 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
75
832
0
05 Jul 2023
CFL: Causally Fair Language Models Through Token-level Attribute Controlled Generation
Rahul Madhavan
Rishabh Garg
Kahini Wadhawan
S. Mehta
21
5
0
01 Jun 2023
An Invariant Learning Characterization of Controlled Text Generation
Carolina Zheng
Claudia Shi
Keyon Vafa
Amir Feder
David M. Blei
OOD
24
8
0
31 May 2023
"I'm fully who I am": Towards Centering Transgender and Non-Binary Voices to Measure Biases in Open Language Generation
Anaelia Ovalle
Palash Goyal
Jwala Dhamala
Zachary Jaggers
Kai-Wei Chang
Aram Galstyan
R. Zemel
Rahul Gupta
23
61
0
17 May 2023
CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants
A. Sun
Varun Nair
Elliot Schumacher
Anitha Kannan
27
3
0
27 Apr 2023
BloombergGPT: A Large Language Model for Finance
Shijie Wu
Ozan Irsoy
Steven Lu
Vadim Dabravolski
Mark Dredze
Sebastian Gehrmann
P. Kambadur
David S. Rosenberg
Gideon Mann
AIFin
71
785
0
30 Mar 2023
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
42
194
0
16 Feb 2023
Towards Agile Text Classifiers for Everyone
Maximilian Mozes
Jessica Hoffmann
Katrin Tomanek
Muhamed Kouate
Nithum Thain
Ann Yuan
Tolga Bolukbasi
Lucas Dixon
34
13
0
13 Feb 2023
Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction
Renee Shelby
Shalaleh Rismani
Kathryn Henne
AJung Moon
Negar Rostamzadeh
...
N'Mah Yilla-Akbari
Jess Gallegos
A. Smart
Emilio Garcia
Gurleen Virk
34
188
0
11 Oct 2022
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
227
500
0
28 Sep 2022
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots
Waiman Si
Michael Backes
Jeremy Blackburn
Emiliano De Cristofaro
Gianluca Stringhini
Savvas Zannettou
Yang Zhang
31
58
0
07 Sep 2022
In conversation with Artificial Intelligence: aligning language models with human values
Atoosa Kasirzadeh
Iason Gabriel
15
98
0
01 Sep 2022
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
Saleh Soltan
Shankar Ananthakrishnan
Jack G. M. FitzGerald
Rahul Gupta
Wael Hamza
...
Mukund Sridhar
Fabian Triefenbach
Apurv Verma
Gökhan Tür
Premkumar Natarajan
46
82
0
02 Aug 2022
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
Maribeth Rauh
John F. J. Mellor
J. Uesato
Po-Sen Huang
Johannes Welbl
...
Amelia Glaese
G. Irving
Iason Gabriel
William S. Isaac
Lisa Anne Hendricks
25
49
0
16 Jun 2022
DIRECTOR: Generator-Classifiers For Supervised Language Modeling
Kushal Arora
Kurt Shuster
Sainbayar Sukhbaatar
Jason Weston
VLM
30
40
0
15 Jun 2022
On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting
Tomasz Korbak
Hady ElSahar
Germán Kruszewski
Marc Dymetman
CLL
15
49
0
01 Jun 2022
Detoxifying Language Models with a Toxic Corpus
Yoon A Park
Frank Rudzicz
16
6
0
30 Apr 2022
Controllable Natural Language Generation with Contrastive Prefixes
Jing Qian
Li Dong
Yelong Shen
Furu Wei
Weizhu Chen
8
95
0
27 Feb 2022
Reward Modeling for Mitigating Toxicity in Transformer-based Language Models
Farshid Faal
K. Schmitt
Jia Yuan Yu
13
25
0
19 Feb 2022
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
Boxin Wang
Wei Ping
Chaowei Xiao
P. Xu
M. Patwary
M. Shoeybi
Bo-wen Li
Anima Anandkumar
Bryan Catanzaro
14
64
0
08 Feb 2022
Cedille: A large autoregressive French language model
Martin Müller
Florian Laurent
34
19
0
07 Feb 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith
M. Patwary
Brandon Norick
P. LeGresley
Samyam Rajbhandari
...
M. Shoeybi
Yuxiong He
Michael Houston
Saurabh Tiwary
Bryan Catanzaro
MoE
57
730
0
28 Jan 2022
Handling Bias in Toxic Speech Detection: A Survey
Tanmay Garg
Sarah Masud
Tharun Suresh
Tanmoy Chakraborty
9
89
0
26 Jan 2022
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick
Sahana Udupa
Hinrich Schütze
259
374
0
28 Feb 2021
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
290
1,814
0
14 Dec 2020
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
208
616
0
03 Sep 2019
1