Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.00086
Cited By
Challenges in Automated Debiasing for Toxic Language Detection
29 January 2021
Xuhui Zhou
Maarten Sap
Swabha Swayamdipta
Noah A. Smith
Yejin Choi
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Challenges in Automated Debiasing for Toxic Language Detection"
38 / 38 papers shown
Title
Safety in Large Reasoning Models: A Survey
Cheng Wang
Yong-Jin Liu
Yangqiu Song
Duzhen Zhang
Zechao Li
Junfeng Fang
Bryan Hooi
LRM
242
2
0
24 Apr 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun Xia
Tianyi Wu
Zhiwei Xue
Yuxiao Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
136
16
0
30 Jan 2025
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Raoyuan Zhao
Abdullatif Köksal
Yihong Liu
Leonie Weissweiler
Anna Korhonen
Hinrich Schütze
SyDa
44
1
0
30 Aug 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
38
35
0
31 May 2024
Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
Xuhui Zhou
Zhe Su
Tiwalayo Eisape
Hyunwoo J. Kim
Maarten Sap
39
38
0
08 Mar 2024
Algorithmic Arbitrariness in Content Moderation
Juan Felipe Gomez
Caio Vieira Machado
Lucas Monteiro Paes
Flavio du Pin Calmon
36
9
0
26 Feb 2024
Generative AI for Hate Speech Detection: Evaluation and Findings
Sagi Pendzel
Tomer Wullach
Amir Adler
Einat Minkov
33
11
0
16 Nov 2023
Examining Temporal Bias in Abusive Language Detection
Mali Jin
Yida Mu
Diana Maynard
Kalina Bontcheva
36
5
0
25 Sep 2023
A Survey on Fairness in Large Language Models
Yingji Li
Mengnan Du
Rui Song
Xin Wang
Ying Wang
ALM
54
60
0
20 Aug 2023
HateModerate: Testing Hate Speech Detectors against Content Moderation Policies
Jiangrui Zheng
Xueqing Liu
Guanqun Yang
Mirazul Haque
Xing Qian
Ravishka Rathnasuriya
Wei Yang
G. Budhrani
49
3
0
23 Jul 2023
Having Beer after Prayer? Measuring Cultural Bias in Large Language Models
Tarek Naous
Michael Joseph Ryan
Alan Ritter
Wei Xu
37
86
0
23 May 2023
Backdoor Learning for NLP: Recent Advances, Challenges, and Future Research Directions
Marwan Omar
SILM
AAML
35
20
0
14 Feb 2023
Towards Agile Text Classifiers for Everyone
Maximilian Mozes
Jessica Hoffmann
Katrin Tomanek
Muhamed Kouate
Nithum Thain
Ann Yuan
Tolga Bolukbasi
Lucas Dixon
52
13
0
13 Feb 2023
Rating Sentiment Analysis Systems for Bias through a Causal Lens
Kausik Lakkaraju
Biplav Srivastava
Marco Valtorta
34
7
0
04 Feb 2023
Multi-VALUE: A Framework for Cross-Dialectal English NLP
Caleb Ziems
William B. Held
Jingfeng Yang
Jwala Dhamala
Rahul Gupta
Diyi Yang
51
41
0
15 Dec 2022
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?
Saadia Gabriel
Hamid Palangi
Yejin Choi
AAML
47
1
0
08 Nov 2022
The State of Profanity Obfuscation in Natural Language Processing
Debora Nozza
Dirk Hovy
47
7
0
14 Oct 2022
Controlling Bias Exposure for Fair Interpretable Predictions
Zexue He
Yu Wang
Julian McAuley
Bodhisattwa Prasad Majumder
27
19
0
14 Oct 2022
Unified Detoxifying and Debiasing in Language Generation via Inference-time Adaptive Optimization
Zonghan Yang
Xiaoyuan Yi
Peng Li
Yang Liu
Xing Xie
38
33
0
10 Oct 2022
Toward Understanding Bias Correlations for Mitigation in NLP
Lu Cheng
Suyu Ge
Huan Liu
39
8
0
24 May 2022
Towards Debiasing Translation Artifacts
Koel Dutta Chowdhury
Rricha Jalota
C. España-Bonet
Josef van Genabith
31
6
0
16 May 2022
Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes
Antonis Maronikolakis
Philip Baader
Hinrich Schütze
34
9
0
13 May 2022
A Call for Clarity in Beam Search: How It Works and When It Stops
Jungo Kasai
Keisuke Sakaguchi
Ronan Le Bras
Dragomir R. Radev
Yejin Choi
Noah A. Smith
28
6
0
11 Apr 2022
On Explaining Multimodal Hateful Meme Detection Models
Ming Shan Hee
Roy Ka-wei Lee
Wen-Haw Chong
VLM
29
39
0
04 Apr 2022
Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal
Umang Gupta
Jwala Dhamala
Varun Kumar
Apurv Verma
Yada Pruksachatkun
Satyapriya Krishna
Rahul Gupta
Kai-Wei Chang
Greg Ver Steeg
Aram Galstyan
21
49
0
23 Mar 2022
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
Thomas Hartvigsen
Saadia Gabriel
Hamid Palangi
Maarten Sap
Dipankar Ray
Ece Kamar
38
354
0
17 Mar 2022
Handling Bias in Toxic Speech Detection: A Survey
Tanmay Garg
Sarah Masud
Tharun Suresh
Tanmoy Chakraborty
17
91
0
26 Jan 2022
Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings
Andrew Wang
Mohit Sudhakar
Yangfeng Ji
20
2
0
15 Dec 2021
"Stop Asian Hate!" : Refining Detection of Anti-Asian Hate Speech During the COVID-19 Pandemic
H. Nghiem
Fred Morstatter
25
8
0
04 Dec 2021
Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection
Maarten Sap
Swabha Swayamdipta
Laura Vianna
Xuhui Zhou
Yejin Choi
Noah A. Smith
46
269
0
15 Nov 2021
Mitigating Racial Biases in Toxic Language Detection with an Equity-Based Ensemble Framework
Matan Halevy
Camille Harris
A. Bruckman
Diyi Yang
A. Howard
42
35
0
27 Sep 2021
Detecting Inspiring Content on Social Media
Oana Ignat
Y-Lan Boureau
Jane A. Yu
A. Halevy
24
6
0
06 Sep 2021
General-Purpose Question-Answering with Macaw
Oyvind Tafjord
Peter Clark
SyDa
ELM
MLLM
30
59
0
06 Sep 2021
Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts
Ashutosh Baheti
Maarten Sap
Alan Ritter
Mark O. Riedl
21
84
0
26 Aug 2021
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
Emily Dinan
Gavin Abercrombie
A. S. Bergman
Shannon L. Spruit
Dirk Hovy
Y-Lan Boureau
Verena Rieser
43
105
0
07 Jul 2021
LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and Beyond
Daniel Loureiro
A. Jorge
Jose Camacho-Collados
35
26
0
26 May 2021
Detoxifying Language Models Risks Marginalizing Minority Voices
Albert Xu
Eshaan Pathak
Eric Wallace
Suchin Gururangan
Maarten Sap
Dan Klein
24
123
0
13 Apr 2021
Empirical Analysis of Multi-Task Learning for Reducing Model Bias in Toxic Comment Detection
Ameya Vaidya
Feng Mai
Yue Ning
115
21
0
21 Sep 2019
1