Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.11462
Cited By
v1
v2 (latest)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
50 / 814 papers shown
Title
Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models
Jisu Shin
Hoyun Song
Huije Lee
Soyeong Jeong
Jong C. Park
112
9
0
06 Jun 2024
Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech
Neemesh Yadav
Sarah Masud
Vikram Goyal
Vikram Goyal
Md. Shad Akhtar
Tanmoy Chakraborty
75
8
0
06 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
160
8
0
06 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
87
1
0
04 Jun 2024
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept
Guangliang Liu
Haitao Mao
Bochuan Cao
Zhiyu Xue
K. Johnson
Jiliang Tang
Rongrong Wang
LRM
105
10
0
04 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
157
50
0
04 Jun 2024
JBBQ: Japanese Bias Benchmark for Analyzing Social Biases in Large Language Models
Hitomi Yanaka
Namgi Han
Ryoma Kumon
Jie Lu
Masashi Takeshita
Ryo Sekizawa
Taisei Kato
Hiromi Arai
108
4
0
04 Jun 2024
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
99
26
0
03 Jun 2024
Understanding Token Probability Encoding in Output Embeddings
Hakaze Cho
Yoshihiro Sakai
Kenshiro Tanaka
Mariko Kato
Naoya Inoue
74
2
0
03 Jun 2024
LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models
Tianci Liu
Haoyu Wang
Shiyang Wang
Yu Cheng
Jing Gao
ALM
81
1
0
01 Jun 2024
Towards Rationality in Language and Multimodal Agents: A Survey
Bowen Jiang
Yangxinyu Xie
Xiaomeng Wang
Yuan Yuan
Camillo J Taylor
Tanwi Mallick
Weijie J. Su
Camillo J. Taylor
Tanwi Mallick
LLMAG
89
6
0
01 Jun 2024
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Yue Liu
Dawn Song
Peter Henderson
Prateek Mittal
AAML
117
12
0
29 May 2024
Expert-Guided Extinction of Toxic Tokens for Debiased Generation
Xueyao Sun
Kaize Shi
Haoran Tang
Guandong Xu
Qing Li
MU
80
2
0
29 May 2024
Are PPO-ed Language Models Hackable?
Suraj Anand
David Getzen
51
0
0
28 May 2024
Low-rank finetuning for LLMs: A fairness perspective
Saswat Das
Marco Romanelli
Cuong Tran
Zarreen Reza
B. Kailkhura
Ferdinando Fioretto
72
2
0
28 May 2024
TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models
Jaewoo Ahn
Taehyun Lee
Junyoung Lim
Jin-Hwa Kim
Sangdoo Yun
Hwaran Lee
Gunhee Kim
LLMAG
HILM
88
14
0
28 May 2024
Aligning to Thousands of Preferences via System Message Generalization
Seongyun Lee
Sue Hyun Park
Seungone Kim
Minjoon Seo
ALM
113
49
0
28 May 2024
White-box Multimodal Jailbreaks Against Large Vision-Language Models
Ruofan Wang
Xingjun Ma
Hanxu Zhou
Chuanjun Ji
Guangnan Ye
Yu-Gang Jiang
AAML
VLM
84
24
0
28 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
147
62
0
26 May 2024
Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges
Jonas Becker
Jan Philip Wahle
Bela Gipp
Terry Ruas
122
11
0
24 May 2024
Linearly Controlled Language Generation with Performative Guarantees
Emily Cheng
Marco Baroni
Carmen Amo Alonso
109
3
0
24 May 2024
Robustifying Safety-Aligned Large Language Models through Clean Data Curation
Xiaoqun Liu
Jiacheng Liang
Muchao Ye
Zhaohan Xi
AAML
123
23
0
24 May 2024
Large Language Model Sentinel: LLM Agent for Adversarial Purification
Guang Lin
Qibin Zhao
Qibin Zhao
AAML
122
4
0
24 May 2024
Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization
Yihao Huang
Chong Wang
Xiaojun Jia
Qing Guo
Felix Juefei Xu
Jian Zhang
G. Pu
Yang Liu
111
9
0
23 May 2024
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
160
7
0
22 May 2024
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim
Gary Geunbae Lee
AAML
126
3
0
21 May 2024
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
Jiaxu Liu
Xiangyu Yin
Sihao Wu
Jianhong Wang
Meng Fang
Xinping Yi
Xiaowei Huang
100
5
0
21 May 2024
Unveiling and Manipulating Prompt Influence in Large Language Models
Zijian Feng
Hanzhang Zhou
Zixiao Zhu
Junlang Qian
Kezhi Mao
91
2
0
20 May 2024
Sociotechnical Implications of Generative Artificial Intelligence for Information Access
Bhaskar Mitra
Henriette Cramer
Olya Gurevich
125
2
0
19 May 2024
MBIAS: Mitigating Bias in Large Language Models While Retaining Context
Shaina Raza
Ananya Raval
Veronica Chatrath
132
10
0
18 May 2024
Exploring Subjectivity for more Human-Centric Assessment of Social Biases in Large Language Models
Paula Akemi Aoyagui
Sharon Ferguson
Anastasia Kuzminykh
83
0
0
17 May 2024
Realistic Evaluation of Toxicity in Large Language Models
Tinh Son Luong
Thanh-Thien Le
Linh Ngo Van
Thien Huu Nguyen
LM&MA
69
6
0
17 May 2024
Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models
Shaz Furniturewala
Surgan Jandial
Abhinav Java
Pragyan Banerjee
Simra Shahid
Sumita Bhatia
Kokil Jaidka
109
11
0
16 May 2024
Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT
Shucheng Zhu
Weikang Wang
Ying Liu
70
6
0
11 May 2024
Mitigating Exaggerated Safety in Large Language Models
Ruchi Bhalani
Ruchira Ray
64
2
0
08 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Preetam Prabhu Srikar Dammu
Hayoung Jung
Anjali Singh
Monojit Choudhury
Tanushree Mitra
104
10
0
08 May 2024
AffirmativeAI: Towards LGBTQ+ Friendly Audit Frameworks for Large Language Models
Yinru Long
Zilin Ma
Yiyang Mei
Zhaoyuan Su
AI4MH
81
0
0
07 May 2024
FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models
Yanhong Bai
Jiabao Zhao
Jinxin Shi
Zhentao Xie
Xingjiao Wu
Liang He
61
3
0
06 May 2024
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Feiyang Kang
H. Just
Yifan Sun
Himanshu Jahagirdar
Yuanzhi Zhang
Rongxing Du
Anit Kumar Sahu
Ruoxi Jia
102
22
0
05 May 2024
Controllable Text Generation in the Instruction-Tuning Era
D. Ashok
Barnabas Poczos
105
6
0
02 May 2024
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Aaron Jiaxun Li
Satyapriya Krishna
Himabindu Lakkaraju
61
4
0
29 Apr 2024
LangBiTe: A Platform for Testing Bias in Large Language Models
Sergio Morales
Robert Clarisó
Jordi Cabot
43
2
0
29 Apr 2024
SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning
Jinghan Jia
Yihua Zhang
Yimeng Zhang
Jiancheng Liu
Bharat Runwal
James Diffenderfer
B. Kailkhura
Sijia Liu
MU
196
50
0
28 Apr 2024
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
Adrian de Wynter
Ishaan Watts
Nektar Ege Altıntoprak
Tua Wongsangaroonsri
Minghui Zhang
...
Anna Vickers
Stéphanie Visser
Herdyan Widarmanto
A. Zaikin
Si-Qing Chen
LM&MA
94
21
0
22 Apr 2024
Stepwise Alignment for Constrained Language Model Policy Optimization
Akifumi Wachi
Thien Q. Tran
Rei Sato
Takumi Tanabe
Yohei Akimoto
85
10
0
17 Apr 2024
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
Yu Li
Zhihua Wei
Han Jiang
Chuanyang Gong
LLMSV
84
3
0
16 Apr 2024
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations
David Nadeau
Mike Kroutikov
Karen McNeil
Simon Baribeau
HILM
58
7
0
15 Apr 2024
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
Xuan Xie
Jiayang Song
Zhehua Zhou
Yuheng Huang
Da Song
Lei Ma
OffRL
130
6
0
12 Apr 2024
FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations
Jane Dwivedi-Yu
Raaz Dwivedi
Timo Schick
67
2
0
09 Apr 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
Paul Röttger
Fabio Pernisi
Bertie Vidgen
Dirk Hovy
ELM
KELM
167
39
0
08 Apr 2024
Previous
1
2
3
...
5
6
7
...
15
16
17
Next