Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with
Minimal Impact on Coherence and Evasiveness in Dialogue Agents

Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents

21 May 2024

Gary Geunbae Lee

ArXiv (abs)PDF HTML

Papers citing "Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents"

16 / 16 papers shown

Title
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Rafael Rafailov Archit Sharma E. Mitchell Stefano Ermon Christopher D. Manning Chelsea Finn ALM 389 4,169 0 29 May 2023
Is ChatGPT a Good NLG Evaluator? A Preliminary Study Jiaan Wang Yunlong Liang Fandong Meng Zengkui Sun Haoxiang Shi Zhixu Li Jinan Xu Jianfeng Qu Jie Zhou LM&MA ELM ALM AI4MH 140 472 0 07 Mar 2023
Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models Shrimai Prabhumoye M. Patwary Mohammad Shoeybi Bryan Catanzaro LM&MA 57 21 0 14 Feb 2023
Constitutional AI: Harmlessness from AI Feedback Yuntao Bai Saurav Kadavath Sandipan Kundu Amanda Askell John Kernion ... Dario Amodei Nicholas Joseph Sam McCandlish Tom B. Brown Jared Kaplan SyDa MoMe 214 1,646 0 15 Dec 2022
Self-critiquing models for assisting human evaluators William Saunders Catherine Yeh Jeff Wu Steven Bills Ouyang Long Jonathan Ward Jan Leike ALM ELM 114 306 0 12 Jun 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 900 13,228 0 04 Mar 2022
Ethical and social risks of harm from Language Models Laura Weidinger John F. J. Mellor Maribeth Rauh Conor Griffin J. Uesato ... Lisa Anne Hendricks William S. Isaac Sean Legassick G. Irving Iason Gabriel PILM 128 1,044 0 08 Dec 2021
LoRA: Low-Rank Adaptation of Large Language Models J. E. Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang Weizhu Chen OffRL AI4TS AI4CE ALM AIMat 509 10,526 0 17 Jun 2021
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models Samuel Gehman Suchin Gururangan Maarten Sap Yejin Choi Noah A. Smith 175 1,221 0 24 Sep 2020
Learning to summarize from human feedback Nisan Stiennon Long Ouyang Jeff Wu Daniel M. Ziegler Ryan J. Lowe Chelsea Voss Alec Radford Dario Amodei Paul Christiano ALM 262 2,192 0 02 Sep 2020
Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning Rahul Pandey Hemant Purohit Carlos Castillo V. Shalin 41 36 0 07 Jul 2020
Can You Put it All Together: Evaluating Conversational Agents' Ability to Blend Skills Eric Michael Smith Mary Williamson Kurt Shuster Jason Weston Y-Lan Boureau 81 224 0 17 Apr 2020
CTRL: A Conditional Transformer Language Model for Controllable Generation N. Keskar Bryan McCann Lav Varshney Caiming Xiong R. Socher AI4CE 142 1,239 0 11 Sep 2019
Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack Emily Dinan Samuel Humeau Bharath Chintagunta Jason Weston 96 248 0 17 Aug 2019
Proximal Policy Optimization Algorithms John Schulman Filip Wolski Prafulla Dhariwal Alec Radford Oleg Klimov OffRL 583 19,315 0 20 Jul 2017
Deep reinforcement learning from human preferences Paul Christiano Jan Leike Tom B. Brown Miljan Martic Shane Legg Dario Amodei 220 3,377 0 12 Jun 2017

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.