Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.15761
Cited By
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
31 December 2020
Bertie Vidgen
Tristan Thrush
Zeerak Talat
Douwe Kiela
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection"
50 / 76 papers shown
Title
Towards a comprehensive taxonomy of online abusive language informed by machine leaning
Samaneh Hosseini Moghaddam
Kelly Lyons
Cheryl Regehr
Vivek Goel
Kaitlyn Regehr
30
0
0
24 Apr 2025
A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English
Julian Bäumler
Louis Blöcher
Lars-Joel Frey
Xian Chen
Markus Bayer
Christian A. Reuter
AILaw
46
0
0
11 Apr 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Ryan Cotterell
40
108
0
10 Apr 2025
LLM-C3MOD: A Human-LLM Collaborative System for Cross-Cultural Hate Speech Moderation
Junyeong Park
Seogyeong Jeong
Shri Kiran Srinivasan
Yohan Lee
Alice H. Oh
57
0
0
10 Mar 2025
CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
Xiangyu Yin
Jiaxu Liu
Zhen Chen
Jinwei Hu
Yi Dong
Xiaowei Huang
Wenjie Ruan
AAML
50
0
0
08 Mar 2025
Improving Hate Speech Classification with Cross-Taxonomy Dataset Integration
Jan Fillies
Adrian Paschke
39
0
0
07 Mar 2025
Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations
David Hartmann
Amin Oueslati
Dimitri Staufer
Lena Pohlmann
Simon Munzert
Hendrik Heuer
55
0
0
03 Mar 2025
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
Seanie Lee
Dong Bok Lee
Dominik Wagner
Minki Kang
Haebin Seong
Tobias Bocklet
Juho Lee
Sung Ju Hwang
12
1
0
18 Feb 2025
Peering Behind the Shield: Guardrail Identification in Large Language Models
Ziqing Yang
Yixin Wu
Rui Wen
Michael Backes
Yang Zhang
63
1
0
03 Feb 2025
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
3
0
24 Dec 2024
SubData: Bridging Heterogeneous Datasets to Enable Theory-Driven Evaluation of Political and Demographic Perspectives in LLMs
Leon Fröhling
Pietro Bernardelle
Gianluca Demartini
ALM
79
0
0
21 Dec 2024
Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models
S. Tong
Eliott Zemour
Rawisara Lohanimit
Lalana Kagal
65
0
0
02 Dec 2024
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
Manuel Tonneau
Diyi Liu
Niyati Malhotra
Scott A. Hale
Samuel Fraiberger
Victor Orozco-Olvera
Paul Röttger
78
0
0
23 Nov 2024
Perceiving and Countering Hate: The Role of Identity in Online Responses
Kaike Ping
James Hawdon
Eugenia H Rho
39
0
0
03 Nov 2024
LLMScan: Causal Scan for LLM Misbehavior Detection
Mengdi Zhang
Kai Kiat Goh
Peixin Zhang
Jun Sun
Rose Lin Xin
Hongyu Zhang
23
0
0
22 Oct 2024
DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?
Urja Khurana
Eric T. Nalisnick
Antske Fokkens
48
1
0
21 Oct 2024
Personas with Attitudes: Controlling LLMs for Diverse Data Annotation
Leon Fröhling
Gianluca Demartini
Dennis Assenmacher
29
5
0
15 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
67
2
0
02 Oct 2024
Decoding Hate: Exploring Language Models' Reactions to Hate Speech
Paloma Piot
Javier Parapar
45
1
0
01 Oct 2024
Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Yungi Kim
Hyunsoo Ha
Sukyung Lee
Jihoo Kim
Seonghoon Yang
Chanjun Park
41
0
0
15 Sep 2024
THInC: A Theory-Driven Framework for Computational Humor Detection
Victor De Marez
Thomas Winters
Ayla Rigouts Terryn
21
2
0
02 Sep 2024
The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
Bocheng Chen
Hanqing Guo
Guangjing Wang
Yuanda Wang
Qiben Yan
AAML
37
4
0
01 Sep 2024
Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?
Urja Khurana
Eric T. Nalisnick
Antske Fokkens
Swabha Swayamdipta
42
3
0
26 Aug 2024
GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models
Kunsheng Tang
Wenbo Zhou
Jie Zhang
Aishan Liu
Gelei Deng
Shuai Li
Peigui Qi
Weiming Zhang
Tianwei Zhang
Nenghai Yu
46
3
0
22 Aug 2024
A Study on Bias Detection and Classification in Natural Language Processing
Ana Sofia Evans
Helena Moniz
Luísa Coheur
33
0
0
14 Aug 2024
Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation
Huimin Lu
Masaru Isonuma
Junichiro Mori
Ichiro Sakata
MU
39
0
0
24 Jul 2024
Automated Adversarial Discovery for Safety Classifiers
Yash Kumar Lal
Preethi Lahoti
Aradhana Sinha
Yao Qin
Ananth Balashankar
55
0
0
24 Jun 2024
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Shu Wang
Xing Xie
Xing Xie
ALM
ELM
52
5
0
20 Jun 2024
Label-aware Hard Negative Sampling Strategies with Momentum Contrastive Learning for Implicit Hate Speech Detection
Jaehoon Kim
Seungwan Jin
Sohyun Park
Someen Park
Kyungsik Han
46
2
0
12 Jun 2024
Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster
Agostina Calabrese
Leonardo Neves
Neil Shah
Maarten W. Bos
Björn Ross
Mirella Lapata
Francesco Barbieri
FAtt
36
1
0
06 Jun 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
63
12
0
28 May 2024
Leveraging Large Language Models for Semantic Query Processing in a Scholarly Knowledge Graph
Runsong Jia
Bowen Zhang
Sergio J. Rodríguez Méndez
Pouya Ghiasnezhad Omran
RALM
34
5
0
24 May 2024
Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT
Shucheng Zhu
Weikang Wang
Ying Liu
37
5
0
11 May 2024
From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets
Manuel Tonneau
Diyi Liu
Samuel Fraiberger
Ralph Schroeder
Scott A. Hale
Paul Röttger
37
5
0
27 Apr 2024
HateTinyLLM : Hate Speech Detection Using Tiny Large Language Models
Tanmay Sen
Ansuman Das
Mrinmay Sen
39
4
0
26 Apr 2024
Target Span Detection for Implicit Harmful Content
Nazanin Jafari
James Allan
Sheikh Muhammad Sarwar
43
1
0
28 Mar 2024
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
Luiza Amador Pozzobon
Patrick Lewis
Sara Hooker
Beyza Ermis
38
7
0
06 Mar 2024
Beyond Hate Speech: NLP's Challenges and Opportunities in Uncovering Dehumanizing Language
Hezhao Zhang
Lasana Harris
N. Moosavi
AILaw
43
1
0
21 Feb 2024
Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges
Aiqi Jiang
A. Zubiaga
AAML
31
3
0
17 Jan 2024
An Investigation of Large Language Models for Real-World Hate Speech Detection
Keyan Guo
Alexander Hu
Jaden Mu
Ziheng Shi
Ziming Zhao
Nishant Vishwamitra
Hongxin Hu
25
12
0
07 Jan 2024
HCDIR: End-to-end Hate Context Detection, and Intensity Reduction model for online comments
Neeraj Kumar Singh
Koyel Ghosh
Joy Mahapatra
Utpal Garain
Apurbalal Senapati
11
0
0
20 Dec 2023
Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models
Jiang Zhang
Qiong Wu
Yiming Xu
Cheng Cao
Zheng Du
Konstantinos Psounis
33
14
0
13 Dec 2023
Enhancing Robustness of Foundation Model Representations under Provenance-related Distribution Shifts
Xiruo Ding
Zhecheng Sheng
Brian Hur
Feng Chen
Serguei V. S. Pakhomov
Trevor Cohen
OOD
20
0
0
09 Dec 2023
How Far Can We Extract Diverse Perspectives from Large Language Models?
Shirley Anugrah Hayati
Minhwa Lee
Dheeraj Rajagopal
Dongyeop Kang
40
10
0
16 Nov 2023
FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics
Yupei Du
Albert Gatt
Dong Nguyen
31
1
0
10 Oct 2023
Examining Temporal Bias in Abusive Language Detection
Mali Jin
Yida Mu
Diana Maynard
Kalina Bontcheva
34
5
0
25 Sep 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger
Hannah Rose Kirk
Bertie Vidgen
Giuseppe Attanasio
Federico Bianchi
Dirk Hovy
ALM
ELM
AILaw
27
125
0
02 Aug 2023
HateModerate: Testing Hate Speech Detectors against Content Moderation Policies
Jiangrui Zheng
Xueqing Liu
Guanqun Yang
Mirazul Haque
Xing Qian
Ravishka Rathnasuriya
Wei Yang
G. Budhrani
42
3
0
23 Jul 2023
CL-UZH at SemEval-2023 Task 10: Sexism Detection through Incremental Fine-Tuning and Multi-Task Learning with Label Descriptions
Janis Goldzycher
18
1
0
06 Jun 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Rui Pan
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
18
404
0
13 Apr 2023
1
2
Next