Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.16235
Cited By
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
23 June 2024
Xiaochen Li
Zheng-Xin Yong
Stephen H. Bach
CLL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Preference Tuning For Toxicity Mitigation Generalizes Across Languages"
13 / 13 papers shown
Title
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Vera Neplenbroek
Arianna Bisazza
Raquel Fernández
103
0
0
17 Feb 2025
Learning to Summarize from LLM-generated Feedback
Hwanjun Song
Taewon Yun
Yuho Lee
Jihwan Oh
Gihun Lee
Jason (Jinglun) Cai
Hang Su
73
2
0
28 Jan 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Jannik Brinkmann
Chris Wendler
Christian Bartelt
Aaron Mueller
51
9
0
10 Jan 2025
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Samuele Poppi
Zheng-Xin Yong
Yifei He
Bobbie Chern
Han Zhao
Aobo Yang
Jianfeng Chi
AAML
45
14
0
23 Oct 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
44
27
0
16 Jul 2024
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
Adrian de Wynter
Ishaan Watts
Nektar Ege Altıntoprak
Tua Wongsangaroonsri
Minghui Zhang
...
Anna Vickers
Stéphanie Visser
Herdyan Widarmanto
A. Zaikin
Si-Qing Chen
LM&MA
52
16
0
22 Apr 2024
MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages
Daryna Dementieva
N. Babakov
Alexander Panchenko
40
6
0
02 Apr 2024
A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre
Sayash Kapoor
Kevin Klyman
Ashwin Ramaswami
Rishi Bommasani
...
Daniel Kang
Sandy Pentland
Arvind Narayanan
Percy Liang
Peter Henderson
55
38
0
07 Mar 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
57
79
0
07 Feb 2024
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh
Winnie Xu
Niklas Muennighoff
Dan Jurafsky
Douwe Kiela
170
449
0
02 Feb 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
71
95
0
03 Jan 2024
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk
Ishita Mediratta
Christoforos Nalmpantis
Jelena Luketina
Eric Hambro
Edward Grefenstette
Roberta Raileanu
AI4CE
ALM
103
121
0
10 Oct 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,915
0
04 Mar 2022
1