Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.19464
Cited By
Curiosity-driven Red-teaming for Large Language Models
29 February 2024
Zhang-Wei Hong
Idan Shenfeld
Tsun-Hsuan Wang
Yung-Sung Chuang
Aldo Pareja
James R. Glass
Akash Srivastava
Pulkit Agrawal
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Curiosity-driven Red-teaming for Large Language Models"
10 / 10 papers shown
Title
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming
Stefan Schoepf
Muhammad Zaid Hameed
Ambrish Rawat
Kieran Fraser
Giulio Zizzo
Giandomenico Cornacchia
Mark Purcell
42
0
0
08 Mar 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
V. Cevher
AAML
51
0
0
24 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
112
10
0
28 Jan 2025
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
Zhi-Yi Chin
Kuan-Chen Mu
Mario Fritz
Pin-Yu Chen
DiffM
87
0
0
25 Nov 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
67
2
0
02 Oct 2024
Adaptive teachers for amortized samplers
Minsu Kim
Sanghyeok Choi
Taeyoung Yun
Emmanuel Bengio
Leo Feng
Jarrid Rector-Brooks
Sungsoo Ahn
Jinkyoo Park
Nikolay Malkin
Yoshua Bengio
159
2
0
02 Oct 2024
CELL your Model: Contrastive Explanations for Large Language Models
Ronny Luss
Erik Miehling
Amit Dhurandhar
47
0
0
17 Jun 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
63
12
0
28 May 2024
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
225
446
0
23 Aug 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
1