Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.11462
Cited By
v1
v2 (latest)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
50 / 814 papers shown
Title
Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent Problems in AI Alignment using Large-Language Models
S. Phelps
Rebecca E. Ranson
LLMAG
69
1
0
20 Jul 2023
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Seonghyeon Ye
Doyoung Kim
Sungdong Kim
Hyeonbin Hwang
Seungone Kim
Yongrae Jo
James Thorne
Juho Kim
Minjoon Seo
ALM
138
108
0
20 Jul 2023
Gender-tuning: Empowering Fine-tuning for Debiasing Pre-trained Language Models
Somayeh Ghanbarzadeh
Yan-ping Huang
Hamid Palangi
R. C. Moreno
Hamed Khanpour
70
12
0
20 Jul 2023
How is ChatGPT's behavior changing over time?
Lingjiao Chen
Matei A. Zaharia
James Zou
ELM
KELM
AI4MH
143
433
0
18 Jul 2023
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
Gelei Deng
Yi Liu
Yuekang Li
Kailong Wang
Ying Zhang
Zefeng Li
Haoyu Wang
Tianwei Zhang
Yang Liu
SILM
101
138
0
16 Jul 2023
Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots
Bocheng Chen
Guangjing Wang
Hanqing Guo
Yuanda Wang
Qiben Yan
95
17
0
14 Jul 2023
Effective Prompt Extraction from Language Models
Yiming Zhang
Nicholas Carlini
Daphne Ippolito
MIACV
SILM
109
43
0
13 Jul 2023
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Mian
OffRL
261
630
0
12 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
100
506
0
10 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
154
125
0
06 Jul 2023
PREADD: Prefix-Adaptive Decoding for Controlled Text Generation
Jonathan Pei
Kevin Kaichuang Yang
Dan Klein
125
21
0
06 Jul 2023
A Survey on Evaluation of Large Language Models
Yu-Chu Chang
Xu Wang
Jindong Wang
Yuanyi Wu
Linyi Yang
...
Yue Zhang
Yi-Ju Chang
Philip S. Yu
Qian Yang
Xingxu Xie
ELM
LM&MA
ALM
223
1,773
0
06 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
238
1,005
0
05 Jul 2023
Understanding Counterspeech for Online Harm Mitigation
Yi-Ling Chung
Gavin Abercrombie
Florence E. Enock
Jonathan Bright
Verena Rieser
67
18
0
01 Jul 2023
Queer People are People First: Deconstructing Sexual Identity Stereotypes in Large Language Models
Harnoor Dhingra
Preetiha Jayashanker
Sayali S. Moghe
Emma Strubell
82
13
0
30 Jun 2023
Transformers in Healthcare: A Survey
Subhash Nerella
S. Bandyopadhyay
Jiaqing Zhang
Miguel Contreras
Scott Siegel
...
Jessica Sena
B. Shickel
A. Bihorac
Kia Khezeli
Parisa Rashidi
MedIm
AI4CE
98
35
0
30 Jun 2023
Stay on topic with Classifier-Free Guidance
Guillaume Sanchez
Honglu Fan
Alexander Spangher
Elad Levi
Pawan Sasanka Ammanamanchi
Stella Biderman
3DV
107
55
0
30 Jun 2023
ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles
Haoqin Tu
Bowen Yang
Xianfeng Zhao
63
6
0
29 Jun 2023
Towards Measuring the Representation of Subjective Global Opinions in Language Models
Esin Durmus
Karina Nyugen
Thomas I. Liao
Nicholas Schiefer
Amanda Askell
...
Alex Tamkin
Janel Thamkul
Jared Kaplan
Jack Clark
Deep Ganguli
149
246
0
28 Jun 2023
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models
Yufei Huang
Deyi Xiong
ALM
134
19
0
28 Jun 2023
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models
Neel Jain
Khalid Saifullah
Yuxin Wen
John Kirchenbauer
Manli Shu
Aniruddha Saha
Micah Goldblum
Jonas Geiping
Tom Goldstein
ALM
ELM
102
23
0
23 Jun 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi
Kaixuan Huang
Ashwinee Panda
Peter Henderson
Mengdi Wang
Prateek Mittal
AAML
127
173
0
22 Jun 2023
Mass-Producing Failures of Multimodal Systems with Language Models
Shengbang Tong
Erik Jones
Jacob Steinhardt
108
36
0
21 Jun 2023
TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models
Yue Huang
Qihui Zhang
Philip S. Y
Lichao Sun
71
54
0
20 Jun 2023
Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling
Lin F. Yang
Hongyang Chen
Zhao Li
Xiao Ding
Xindong Wu
KELM
116
93
0
20 Jun 2023
Towards Theory-based Moral AI: Moral AI with Aggregating Models Based on Normative Ethical Theory
Masashi Takeshita
Rafal Rzepka
K. Araki
59
9
0
20 Jun 2023
KEST: Kernel Distance Based Efficient Self-Training for Improving Controllable Text Generation
Yuxi Feng
Xiaoyuan Yi
L. Lakshmanan
Xing Xie
75
1
0
17 Jun 2023
Conformal Language Modeling
Victor Quach
Adam Fisch
Tal Schuster
Adam Yala
J. Sohn
Tommi Jaakkola
Regina Barzilay
253
67
0
16 Jun 2023
CHORUS: Foundation Models for Unified Data Discovery and Exploration
Moe Kayali
A. Lykov
Ilias Fountalis
N. Vasiloglou
Dan Olteanu
Dan Suciu
99
25
0
16 Jun 2023
Evaluating the Social Impact of Generative AI Systems in Systems and Society
Irene Solaiman
Zeerak Talat
William Agnew
Lama Ahmad
Dylan K. Baker
...
Marie-Therese Png
Shubham Singh
A. Strait
Lukas Struppek
Arjun Subramonian
ELM
EGVM
152
117
0
09 Jun 2023
Prompt Injection attack against LLM-integrated Applications
Yi Liu
Gelei Deng
Yuekang Li
Kailong Wang
Zihao Wang
...
Tianwei Zhang
Yepang Liu
Haoyu Wang
Yanhong Zheng
Yang Liu
SILM
128
365
0
08 Jun 2023
Long-form analogies generated by chatGPT lack human-like psycholinguistic properties
S. M. Seals
V. Shalin
50
12
0
07 Jun 2023
Click: Controllable Text Generation with Sequence Likelihood Contrastive Learning
Chujie Zheng
Pei Ke
Zheng Zhang
Minlie Huang
BDL
83
34
0
06 Jun 2023
AHA!: Facilitating AI Impact Assessment by Generating Examples of Harms
Zana Buçinca
Chau Minh Pham
Maurice Jakesch
Marco Tulio Ribeiro
Alexandra Olteanu
Saleema Amershi
69
37
0
05 Jun 2023
Structured Voronoi Sampling
Afra Amini
Li Du
Ryan Cotterell
DiffM
101
2
0
05 Jun 2023
On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research
Made Nindyatama Nityasya
Haryo Akbarianto Wibowo
Alham Fikri Aji
Genta Indra Winata
Radityo Eko Prasojo
Phil Blunsom
A. Kuncoro
65
8
0
05 Jun 2023
Exposing Bias in Online Communities through Large-Scale Language Models
Celine Wald
Lukas Pfahler
67
6
0
04 Jun 2023
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Zeqiu Wu
Yushi Hu
Weijia Shi
Nouha Dziri
Alane Suhr
Prithviraj Ammanabrolu
Noah A. Smith
Mari Ostendorf
Hannaneh Hajishirzi
ALM
168
336
0
02 Jun 2023
CFL: Causally Fair Language Models Through Token-level Attribute Controlled Generation
Rahul Madhavan
Rishabh Garg
Kahini Wadhawan
S. Mehta
87
5
0
01 Jun 2023
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Shengran Hu
Jeff Clune
LM&Ro
OffRL
LRM
AI4CE
89
29
0
01 Jun 2023
An Invariant Learning Characterization of Controlled Text Generation
Carolina Zheng
Claudia Shi
Keyon Vafa
Amir Feder
David M. Blei
OOD
103
8
0
31 May 2023
Controlled Text Generation with Hidden Representation Transformations
Vaibhav Kumar
H. Koorehdavoudi
Masud Moshtaghi
Amita Misra
Ankit Chadha
Emilio Ferrara
62
3
0
30 May 2023
KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application
Hwaran Lee
Seokhee Hong
Joonsuk Park
Takyoung Kim
Gunhee Kim
Jung-Woo Ha
106
30
0
28 May 2023
SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration
Hwaran Lee
Seokhee Hong
Joonsuk Park
Takyoung Kim
M. Cha
...
Eun-Ju Lee
Yong Lim
Alice Oh
San-hee Park
Jung-Woo Ha
98
18
0
28 May 2023
Query-Efficient Black-Box Red Teaming via Bayesian Optimization
Deokjae Lee
JunYeong Lee
Jung-Woo Ha
Jin-Hwa Kim
Sang-Woo Lee
Hwaran Lee
Hyun Oh Song
AAML
91
25
0
27 May 2023
Generating Images with Multimodal Language Models
Jing Yu Koh
Daniel Fried
Ruslan Salakhutdinov
MLLM
164
259
0
26 May 2023
From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language Models
Julia Mendelsohn
Ronan Le Bras
Yejin Choi
Maarten Sap
61
29
0
26 May 2023
Training Socially Aligned Language Models on Simulated Social Interactions
Ruibo Liu
Ruixin Yang
Chenyan Jia
Ge Zhang
Denny Zhou
Andrew M. Dai
Diyi Yang
Soroush Vosoughi
ALM
80
56
0
26 May 2023
The False Promise of Imitating Proprietary LLMs
Arnav Gudibande
Eric Wallace
Charles Burton Snell
Xinyang Geng
Hao Liu
Pieter Abbeel
Sergey Levine
Dawn Song
ALM
144
208
0
25 May 2023
Scaling Data-Constrained Language Models
Niklas Muennighoff
Alexander M. Rush
Boaz Barak
Teven Le Scao
Aleksandra Piktus
Nouamane Tazi
S. Pyysalo
Thomas Wolf
Colin Raffel
ALM
196
226
0
25 May 2023
Previous
1
2
3
...
10
11
12
...
15
16
17
Next