Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.11176
Cited By
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
22 February 2022
Alyssa Lees
Vinh Q. Tran
Yi Tay
Jeffrey Scott Sorensen
Jai Gupta
Donald Metzler
Lucy Vasserman
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A New Generation of Perspective API: Efficient Multilingual Character-level Transformers"
50 / 102 papers shown
Title
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng Li
Anil Vullikanti
AAML
31
0
0
11 May 2025
Mapping the Italian Telegram Ecosystem: Communities, Toxicity, and Hate Speech
Lorenzo Alvisi
S. Tardelli
Maurizio Tesconi
188
0
0
28 Apr 2025
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform
Xingyu Lu
Tianke Zhang
Chang Meng
Xinyu Wang
Jinpeng Wang
...
Hai-Tao Zheng
Fan Yang
Tingting Gao
Di Zhang
Kun Gai
OffRL
54
0
0
21 Apr 2025
Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
Shanilka Haturusinghe
Tharindu Cyril Weerasooriya
Marcos Zampieri
Christopher Homan
S. Liyanage
48
0
0
02 Apr 2025
Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno DÍncà
E. Peruzzo
Xingqian Xu
Humphrey Shi
N. Sebe
Massimiliano Mancini
MU
60
0
0
14 Mar 2025
SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations
Xingwei Tan
Chen Lyu
Hafiz Muhammad Umer
Sahrish Khan
Mahathi Parvatham
Lois Arthurs
Simon Cullen
Shelley Wilson
Arshad Jhumka
Gabriele Pergola
49
0
0
09 Mar 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Vera Neplenbroek
Arianna Bisazza
Raquel Fernández
105
0
0
17 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil
Vipul Gupta
Sarkar Snigdha Sarathi Das
R. Passonneau
202
0
0
07 Feb 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun Xia
Tianyi Wu
Zhiwei Xue
Yuxiao Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
131
14
0
30 Jan 2025
Dynamics of Toxicity in Political Podcasts
Naquee Rizwan
Nayandeep Deb
Sarthak Roy
Vishwajeet Singh Solanki
Kiran Garimella
Animesh Mukherjee
69
0
0
22 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
81
6
0
08 Jan 2025
Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?
Manuel Weber
Moritz Huber
Maximilian Auch
Alexander Döschl
Max-Emanuel Keller
P. Mandl
32
0
0
03 Jan 2025
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
LLM-jp
Akiko Aizawa
Eiji Aramaki
Bowen Chen
Fei Cheng
...
Yuya Yamamoto
Yusuke Yamauchi
Hitomi Yanaka
Rio Yokota
Koichiro Yoshino
57
14
0
31 Dec 2024
Towards Efficient and Explainable Hate Speech Detection via Model Distillation
Paloma Piot
Javier Parapar
83
173
0
18 Dec 2024
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
Manuel Tonneau
Diyi Liu
Niyati Malhotra
Scott A. Hale
Samuel Fraiberger
Victor Orozco-Olvera
Paul Röttger
81
0
0
23 Nov 2024
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
Aaron Zheng
Mansi Rana
Andreas Stolcke
75
1
0
21 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan
Yanjiang Liu
Xinyu Lu
Boxi Cao
Xianpei Han
...
Le Sun
Jie Lou
Bowen Yu
Yunfan LU
Hongyu Lin
ALM
86
2
0
18 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang
Xuehai Tang
Jizhong Han
Songlin Hu
73
0
0
18 Nov 2024
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models
Saketh Bachu
Erfan Shayegani
Trishna Chakraborty
Rohit Lal
Arindam Dutta
Chengyu Song
Yue Dong
Nael B. Abu-Ghazaleh
A. Roy-Chowdhury
36
0
0
06 Nov 2024
On Calibration of LLM-based Guard Models for Reliable Content Moderation
Hongfu Liu
Hengguan Huang
Hao Wang
Xiangming Gu
Ye Wang
60
2
0
14 Oct 2024
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles
Dom Nasrabadi
31
1
0
11 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
87
1
0
09 Oct 2024
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Jan Ebert
Alexander Arno Weber
...
René Jäkel
Georg Rehm
Stefan Kesselheim
Joachim Köhler
Nicolas Flores-Herr
72
6
0
30 Sep 2024
Alignment with Preference Optimization Is All You Need for LLM Safety
Réda Alami
Ali Khalifa Almansoori
Ahmed Alzubaidi
M. Seddik
Mugariya Farooq
Hakim Hacid
40
1
0
12 Sep 2024
Efficient Detection of Toxic Prompts in Large Language Models
Yi Liu
Junzhe Yu
Huijia Sun
Ling Shi
Gelei Deng
Yuqi Chen
Yang Liu
37
4
0
21 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Donghai Hong
Jiaying Chen
Wenhai Wang
44
2
0
18 Aug 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
Robert J. Moss
AAML
31
0
0
11 Aug 2024
The Monetisation of Toxicity: Analysing YouTube Content Creators and Controversy-Driven Engagement
Jian Li
Bowen Xu
Sören Schwertfeger
27
2
0
01 Aug 2024
Towards Generalized Offensive Language Identification
A. Dmonte
Tejas Arya
Tharindu Ranasinghe
Marcos Zampieri
52
3
0
26 Jul 2024
SAFETY-J: Evaluating Safety with Critique
Yixiu Liu
Yuxiang Zheng
Shijie Xia
Jiajun Li
Yi Tu
Chaoling Song
Pengfei Liu
ELM
37
2
0
24 Jul 2024
Tracking Patterns in Toxicity and Antisocial Behavior Over User Lifetimes on Large Social Media Platforms
Katy Blumer
Jon Kleinberg
18
0
0
12 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
45
9
0
10 Jul 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders
Jinseok Kim
Jaewon Jung
Sangyeop Kim
S. Park
Sungzoon Cho
64
0
0
09 Jul 2024
R
2
R^2
R
2
-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Bo-wen Li
LRM
43
12
0
08 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov
26
4
0
01 Jul 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
43
69
0
26 Jun 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun
Vassilina Nikoulina
36
1
0
25 Jun 2024
LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content
Jessica Foo
Shaun Khoo
38
4
0
24 Jun 2024
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Xiaochen Li
Zheng-Xin Yong
Stephen H. Bach
CLL
34
14
0
23 Jun 2024
Supporting Human Raters with the Detection of Harmful Content using Large Language Models
Kurt Thomas
Patrick Gage Kelley
David Tao
Sarah Meiklejohn
Owen Vallis
Shunwen Tan
Blaz Bratanic
Felipe Tiengo Ferreira
Vijay Eranti
Elie Bursztein
46
2
0
18 Jun 2024
TorchOpera: A Compound AI System for LLM Safety
Shanshan Han
Yuhang Yao
Zijian Hu
Dimitris Stripelis
Zhaozhuo Xu
Chaoyang He
LLMAG
44
0
0
16 Jun 2024
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
Zhen Xiang
Linzhi Zheng
Yanjie Li
Junyuan Hong
Qinbin Li
...
Zidi Xiong
Chulin Xie
Carl Yang
Dawn Song
Bo Li
LLMAG
45
23
0
13 Jun 2024
The Life Cycle of Large Language Models: A Review of Biases in Education
Jinsook Lee
Yann Hicke
Renzhe Yu
Christopher A. Brooks
René F. Kizilcec
AI4Ed
42
1
0
03 Jun 2024
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Diego Dorn
Alexandre Variengien
Charbel-Raphaël Ségerie
Vincent Corruble
32
7
0
03 Jun 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
40
19
0
31 May 2024
Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias
Rebecca Dorn
Lee Kezar
Fred Morstatter
Kristina Lerman
32
7
0
23 May 2024
Grounding Toxicity in Real-World Events across Languages
Wondimagegnhue Tufa
Ilia Markov
Piek Vossen
21
0
0
22 May 2024
Jill Watson: A Virtual Teaching Assistant powered by ChatGPT
Karan Taneja
Pratyusha Maiti
Sandeep Kakar
P. Guruprasad
Sanjeev Rao
Ashok K. Goel
35
23
0
17 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Preetam Prabhu Srikar Dammu
Hayoung Jung
Anjali Singh
Monojit Choudhury
Tanushree Mitra
39
8
0
08 May 2024
The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages
Wondimagegnhue Tufa
Ilia Markov
Piek Vossen
13
0
0
29 Apr 2024
1
2
3
Next