A New Generation of Perspective API: Efficient Multilingual Character-level Transformers

22 February 2022

Alyssa Lees

Vinh Q. Tran

Yi Tay

Jeffrey Scott Sorensen

Papers citing "A New Generation of Perspective API: Efficient Multilingual Character-level Transformers"

50 / 102 papers shown

Title
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety Zihan Guan Mengxuan Hu Ronghang Zhu Sheng Li Anil Vullikanti AAML 31 0 0 11 May 2025
Mapping the Italian Telegram Ecosystem: Communities, Toxicity, and Hate Speech Lorenzo Alvisi S. Tardelli Maurizio Tesconi 188 0 0 28 Apr 2025
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform Xingyu Lu Tianke Zhang Chang Meng Xinyu Wang Jinpeng Wang ... Hai-Tao Zheng Fan Yang Tingting Gao Di Zhang Kun Gai OffRL 54 0 0 21 Apr 2025
Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala Shanilka Haturusinghe Tharindu Cyril Weerasooriya Marcos Zampieri Christopher Homan S. Liyanage 48 0 0 02 Apr 2025
Safe Vision-Language Models via Unsafe Weights Manipulation Moreno DÍncà E. Peruzzo Xingqian Xu Humphrey Shi N. Sebe Massimiliano Mancini MU 60 0 0 14 Mar 2025
SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations Xingwei Tan Chen Lyu Hafiz Muhammad Umer Sahrish Khan Mahathi Parvatham Lois Arthurs Simon Cullen Shelley Wilson Arshad Jhumka Gabriele Pergola 49 0 0 09 Mar 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation Vera Neplenbroek Arianna Bisazza Raquel Fernández 105 0 0 17 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet Berk Atil Vipul Gupta Sarkar Snigdha Sarathi Das R. Passonneau 202 0 0 07 Feb 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards Yue Liu Hongcheng Gao Shengfang Zhai Jun Xia Tianyi Wu Zhiwei Xue Yuxiao Chen Kenji Kawaguchi Jiaheng Zhang Bryan Hooi AI4TS LRM 131 14 0 30 Jan 2025
Dynamics of Toxicity in Political Podcasts Naquee Rizwan Nayandeep Deb Sarthak Roy Vishwajeet Singh Solanki Kiran Garimella Animesh Mukherjee 69 0 0 22 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates Fengqing Jiang Zhangchen Xu Luyao Niu Bill Yuchen Lin Radha Poovendran SILM 81 6 0 08 Jan 2025
Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers? Manuel Weber Moritz Huber Maximilian Auch Alexander Döschl Max-Emanuel Keller P. Mandl 32 0 0 03 Jan 2025
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs LLM-jp Akiko Aizawa Eiji Aramaki Bowen Chen Fei Cheng ... Yuya Yamamoto Yusuke Yamauchi Hitomi Yanaka Rio Yokota Koichiro Yoshino 57 14 0 31 Dec 2024
Towards Efficient and Explainable Hate Speech Detection via Model Distillation Paloma Piot Javier Parapar 83 173 0 18 Dec 2024
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter Manuel Tonneau Diyi Liu Niyati Malhotra Scott A. Hale Samuel Fraiberger Victor Orozco-Olvera Paul Röttger 81 0 0 23 Nov 2024
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings Aaron Zheng Mansi Rana Andreas Stolcke 75 1 0 21 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering Xinyan Guan Yanjiang Liu Xinyu Lu Boxi Cao Xianpei Han ... Le Sun Jie Lou Bowen Yu Yunfan LU Hongyu Lin ALM 86 2 0 18 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models Xikang Yang Xuehai Tang Jizhong Han Songlin Hu 73 0 0 18 Nov 2024
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models Saketh Bachu Erfan Shayegani Trishna Chakraborty Rohit Lal Arindam Dutta Chengyu Song Yue Dong Nael B. Abu-Ghazaleh A. Roy-Chowdhury 36 0 0 06 Nov 2024
On Calibration of LLM-based Guard Models for Reliable Content Moderation Hongfu Liu Hengguan Huang Hao Wang Xiangming Gu Ye Wang 60 2 0 14 Oct 2024
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles Dom Nasrabadi 31 1 0 11 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond Shanshan Han 87 1 0 09 Oct 2024
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs Mehdi Ali Michael Fromm Klaudia Thellmann Jan Ebert Alexander Arno Weber ... René Jäkel Georg Rehm Stefan Kesselheim Joachim Köhler Nicolas Flores-Herr 72 6 0 30 Sep 2024
Alignment with Preference Optimization Is All You Need for LLM Safety Réda Alami Ali Khalifa Almansoori Ahmed Alzubaidi M. Seddik Mugariya Farooq Hakim Hacid 40 1 0 12 Sep 2024
Efficient Detection of Toxic Prompts in Large Language Models Yi Liu Junzhe Yu Huijia Sun Ling Shi Gelei Deng Yuqi Chen Yang Liu 37 4 0 21 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks Kexin Chen Yi Liu Donghai Hong Jiaying Chen Wenhai Wang 44 2 0 18 Aug 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search Robert J. Moss AAML 31 0 0 11 Aug 2024
The Monetisation of Toxicity: Analysing YouTube Content Creators and Controversy-Driven Engagement Jian Li Bowen Xu Sören Schwertfeger 27 2 0 01 Aug 2024
Towards Generalized Offensive Language Identification A. Dmonte Tejas Arya Tharindu Ranasinghe Marcos Zampieri 52 3 0 26 Jul 2024
SAFETY-J: Evaluating Safety with Critique Yixiu Liu Yuxiang Zheng Shijie Xia Jiajun Li Yi Tu Chaoling Song Pengfei Liu ELM 37 2 0 24 Jul 2024
Tracking Patterns in Toxicity and Antisocial Behavior Over User Lifetimes on Large Social Media Platforms Katy Blumer Jon Kleinberg 18 0 0 12 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture Jiayang Song Yuheng Huang Zhehua Zhou Lei Ma 45 9 0 10 Jul 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders Jinseok Kim Jaewon Jung Sangyeop Kim S. Park Sungzoon Cho 64 0 0 09 Jul 2024
$R^2$ -Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning Mintong Kang Bo-wen Li LRM 43 12 0 08 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes Dmitrii Volkov 26 4 0 01 Jul 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Seungju Han Kavel Rao Allyson Ettinger Liwei Jiang Bill Yuchen Lin Nathan Lambert Yejin Choi Nouha Dziri 43 69 0 26 Jun 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts Caroline Brun Vassilina Nikoulina 36 1 0 25 Jun 2024
LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content Jessica Foo Shaun Khoo 38 4 0 24 Jun 2024
Preference Tuning For Toxicity Mitigation Generalizes Across Languages Xiaochen Li Zheng-Xin Yong Stephen H. Bach CLL 34 14 0 23 Jun 2024
Supporting Human Raters with the Detection of Harmful Content using Large Language Models Kurt Thomas Patrick Gage Kelley David Tao Sarah Meiklejohn Owen Vallis Shunwen Tan Blaz Bratanic Felipe Tiengo Ferreira Vijay Eranti Elie Bursztein 46 2 0 18 Jun 2024
TorchOpera: A Compound AI System for LLM Safety Shanshan Han Yuhang Yao Zijian Hu Dimitris Stripelis Zhaozhuo Xu Chaoyang He LLMAG 44 0 0 16 Jun 2024
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning Zhen Xiang Linzhi Zheng Yanjie Li Junyuan Hong Qinbin Li ... Zidi Xiong Chulin Xie Carl Yang Dawn Song Bo Li LLMAG 45 23 0 13 Jun 2024
The Life Cycle of Large Language Models: A Review of Biases in Education Jinsook Lee Yann Hicke Renzhe Yu Christopher A. Brooks René F. Kizilcec AI4Ed 42 1 0 03 Jun 2024
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards Diego Dorn Alexandre Variengien Charbel-Raphaël Ségerie Vincent Corruble 32 7 0 03 Jun 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens Jiahao Yu Haozheng Luo Jerry Yao-Chieh Hu Wenbo Guo Han Liu Xinyu Xing 40 19 0 31 May 2024
Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias Rebecca Dorn Lee Kezar Fred Morstatter Kristina Lerman 32 7 0 23 May 2024
Grounding Toxicity in Real-World Events across Languages Wondimagegnhue Tufa Ilia Markov Piek Vossen 21 0 0 22 May 2024
Jill Watson: A Virtual Teaching Assistant powered by ChatGPT Karan Taneja Pratyusha Maiti Sandeep Kakar P. Guruprasad Sanjeev Rao Ashok K. Goel 35 23 0 17 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations Preetam Prabhu Srikar Dammu Hayoung Jung Anjali Singh Monojit Choudhury Tanushree Mitra 39 8 0 08 May 2024
The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages Wondimagegnhue Tufa Ilia Markov Piek Vossen 13 0 0 29 Apr 2024