Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.16750
Cited By
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
28 January 2025
Xinyue Shen
Yixin Wu
Y. Qu
Michael Backes
Savvas Zannettou
Yang Zhang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns"
17 / 17 papers shown
Title
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing
Yu Yan
Sheng Sun
Zhifei Zheng
Ziji Hao
Teli Liu
Min Liu
AAML
62
0
0
27 May 2025
LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models
Ryan Chen
Youngmin Ko
Zeyu Zhang
Catherine Cho
Sunny Chung
Mauro Giuffré
Dennis L. Shung
Bradly C. Stadie
79
0
0
17 May 2025
Echoes of Power: Investigating Geopolitical Bias in US and China Large Language Models
Andre G. C. Pacheco
Athus Cavalini
Giovanni Comarela
56
1
0
20 Mar 2025
Peering Behind the Shield: Guardrail Identification in Large Language Models
Ziqing Yang
Yixin Wu
Rui Wen
Michael Backes
Yang Zhang
68
1
0
03 Feb 2025
Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
Nishant Vishwamitra
Keyan Guo
Farhan Tajwar Romit
Isabelle Ondracek
Long Cheng
Ziming Zhao
Hongxin Hu
29
13
0
22 Dec 2023
Baichuan 2: Open Large-scale Language Models
Ai Ming Yang
Bin Xiao
Bingning Wang
Borong Zhang
Ce Bian
...
Youxin Jiang
Yuchen Gao
Yupeng Zhang
Guosheng Dong
Zhiying Wu
ELM
LRM
129
731
0
19 Sep 2023
No Easy Way Out: the Effectiveness of Deplatforming an Extremist Forum to Suppress Hate and Harassment
Anh V. Vu
Alice Hutchings
Ross Anderson
16
9
0
14 Apr 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
403
13,788
0
15 Mar 2023
I Know What You Trained Last Summer: A Survey on Stealing Machine Learning Models and Defences
Daryna Oliynyk
Rudolf Mayer
Andreas Rauber
84
109
0
16 Jun 2022
Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech
Tomer Wullach
A. Adler
Einat Minkov
19
41
0
01 Sep 2021
TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter
Sumit Kumar
Raj Ratn Pranesh
32
18
0
27 Aug 2021
HateCheck: Functional Tests for Hate Speech Detection Models
Paul Röttger
B. Vidgen
Dong Nguyen
Zeerak Talat
Helen Z. Margetts
J. Pierrehumbert
47
263
0
31 Dec 2020
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
Di Jin
Zhijing Jin
Qiufeng Wang
Peter Szolovits
SILM
AAML
97
1,064
0
27 Jul 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
372
24,160
0
26 Jul 2019
TextBugger: Generating Adversarial Text Against Real-world Applications
Jinfeng Li
S. Ji
Tianyu Du
Bo Li
Ting Wang
SILM
AAML
136
731
0
13 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
854
93,936
0
11 Oct 2018
Deceiving Google's Perspective API Built for Detecting Toxic Comments
Hossein Hosseini
Sreeram Kannan
Baosen Zhang
Radha Poovendran
AAML
26
328
0
27 Feb 2017
1