Gradient-Based Language Model Red Teaming

30 January 2024

Papers citing "Gradient-Based Language Model Red Teaming"

11 / 11 papers shown

Title
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints Jonathan Nöther Adish Singla Goran Radanović AAML 97 0 0 14 Jan 2025
Universal and Transferable Adversarial Attacks on Aligned Language Models Andy Zou Zifan Wang Nicholas Carlini Milad Nasr J. Zico Kolter Matt Fredrikson 160 1,376 0 27 Jul 2023
Query-Efficient Black-Box Red Teaming via Bayesian Optimization Deokjae Lee JunYeong Lee Jung-Woo Ha Jin-Hwa Kim Sang-Woo Lee Hwaran Lee Hyun Oh Song AAML 48 23 0 27 May 2023
Robust Conversational Agents against Imperceptible Toxicity Triggers Ninareh Mehrabi Ahmad Beirami Fred Morstatter Aram Galstyan AAML 46 33 0 05 May 2022
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection Thomas Hartvigsen Saadia Gabriel Hamid Palangi Maarten Sap Dipankar Ray Ece Kamar 42 362 0 17 Mar 2022
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers Alyssa Lees Vinh Q. Tran Yi Tay Jeffrey Scott Sorensen Jai Gupta Donald Metzler Lucy Vasserman 50 182 0 22 Feb 2022
LaMDA: Language Models for Dialog Applications R. Thoppilan Daniel De Freitas Jamie Hall Noam M. Shazeer Apoorv Kulshreshtha ... Blaise Aguera-Arcas Claire Cui M. Croak Ed H. Chi Quoc Le ALM 96 1,577 0 20 Jan 2022
A Plug-and-Play Method for Controlled Text Generation Damian Pascual Béni Egressy Clara Meister Ryan Cotterell Roger Wattenhofer 96 91 0 20 Sep 2021
FUDGE: Controlled Text Generation With Future Discriminators Kevin Kaichuang Yang Dan Klein 87 324 0 12 Apr 2021
Content preserving text generation with attribute controls Lajanugen Logeswaran Honglak Lee Samy Bengio 59 117 0 03 Nov 2018
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables Chris J. Maddison A. Mnih Yee Whye Teh BDL 111 2,523 0 02 Nov 2016