ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2009.11462
  4. Cited By
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
  Models

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
ArXivPDFHTML

Papers citing "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"

50 / 772 papers shown
Title
Survey for Landing Generative AI in Social and E-commerce Recsys -- the
  Industry Perspectives
Survey for Landing Generative AI in Social and E-commerce Recsys -- the Industry Perspectives
Da Xu
Danqing Zhang
Guangyu Yang
Bo Yang
Shuyuan Xu
Lingling Zheng
Cindy Liang
32
2
0
10 Jun 2024
Aligning Large Language Models with Representation Editing: A Control
  Perspective
Aligning Large Language Models with Representation Editing: A Control Perspective
Lingkai Kong
Haorui Wang
Wenhao Mu
Yuanqi Du
Yuchen Zhuang
Yifei Zhou
Yue Song
Rongzhi Zhang
Kai Wang
Chao Zhang
38
22
0
10 Jun 2024
Creativity Has Left the Chat: The Price of Debiasing Language Models
Creativity Has Left the Chat: The Price of Debiasing Language Models
Behnam Mohammadi
45
9
0
08 Jun 2024
A Deep Dive into the Trade-Offs of Parameter-Efficient Preference
  Alignment Techniques
A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques
Megh Thakkar
Quentin Fournier
Matthew D Riemer
Pin-Yu Chen
Amal Zouaq
Payel Das
Sarath Chandar
ALM
50
8
0
07 Jun 2024
Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in
  Large Language Models
Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models
Jisu Shin
Hoyun Song
Huije Lee
Soyeong Jeong
Jong C. Park
38
6
0
06 Jun 2024
Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of
  Implicit Hate Speech
Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech
Neemesh Yadav
Sarah Masud
Vikram Goyal
Vikram Goyal
Md. Shad Akhtar
Tanmoy Chakraborty
36
5
0
06 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a
  Dependency Lens
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
52
8
0
06 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix
  Controller
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
35
0
0
04 Jun 2024
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and
  Latent Concept
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept
Guangliang Liu
Haitao Mao
Bochuan Cao
Zhiyu Xue
K. Johnson
Jiliang Tang
Rongrong Wang
LRM
37
9
0
04 Jun 2024
Analyzing Social Biases in Japanese Large Language Models
Analyzing Social Biases in Japanese Large Language Models
Hitomi Yanaka
Namgi Han
Ryoma Kumon
Jie Lu
Masashi Takeshita
Ryo Sekizawa
Taisei Kato
Hiromi Arai
55
3
0
04 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future
  Pathways
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
54
26
0
04 Jun 2024
Safeguarding Large Language Models: A Survey
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
45
19
0
03 Jun 2024
Understanding Token Probability Encoding in Output Embeddings
Understanding Token Probability Encoding in Output Embeddings
Hakaze Cho
Yoshihiro Sakai
Kenshiro Tanaka
Mariko Kato
Naoya Inoue
46
2
0
03 Jun 2024
LIDAO: Towards Limited Interventions for Debiasing (Large) Language
  Models
LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models
Tianci Liu
Haoyu Wang
Shiyang Wang
Yu Cheng
Jing Gao
ALM
37
0
0
01 Jun 2024
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak
Zhenxing Niu
Yuyao Sun
Haoxuan Ji
Zheng Lin
Haichang Gao
Xinbo Gao
Gang Hua
Rong Jin
42
2
0
30 May 2024
AI Risk Management Should Incorporate Both Safety and Security
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Bo-wen Li
Dawn Song
Peter Henderson
Prateek Mittal
AAML
51
11
0
29 May 2024
Expert-Guided Extinction of Toxic Tokens for Debiased Generation
Expert-Guided Extinction of Toxic Tokens for Debiased Generation
Xueyao Sun
Kaize Shi
Haoran Tang
Guandong Xu
Qing Li
MU
43
1
0
29 May 2024
Are PPO-ed Language Models Hackable?
Are PPO-ed Language Models Hackable?
Suraj Anand
David Getzen
31
0
0
28 May 2024
Low-rank finetuning for LLMs: A fairness perspective
Low-rank finetuning for LLMs: A fairness perspective
Saswat Das
Marco Romanelli
Cuong Tran
Zarreen Reza
B. Kailkhura
Ferdinando Fioretto
40
1
0
28 May 2024
TimeChara: Evaluating Point-in-Time Character Hallucination of
  Role-Playing Large Language Models
TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models
Jaewoo Ahn
Taehyun Lee
Junyoung Lim
Jin-Hwa Kim
Sangdoo Yun
Hwaran Lee
Gunhee Kim
LLMAG
HILM
37
12
0
28 May 2024
Aligning to Thousands of Preferences via System Message Generalization
Aligning to Thousands of Preferences via System Message Generalization
Seongyun Lee
Sue Hyun Park
Seungone Kim
Minjoon Seo
ALM
44
38
0
28 May 2024
White-box Multimodal Jailbreaks Against Large Vision-Language Models
White-box Multimodal Jailbreaks Against Large Vision-Language Models
Ruofan Wang
Xingjun Ma
Hanxu Zhou
Chuanjun Ji
Guangnan Ye
Yu-Gang Jiang
AAML
VLM
49
17
0
28 May 2024
Privacy-Aware Visual Language Models
Privacy-Aware Visual Language Models
Laurens Samson
Nimrod Barazani
S. Ghebreab
Yukiyasu Asano
PILM
VLM
50
1
0
27 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is
  Implicitly an Adversarial Regularizer
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
50
43
0
26 May 2024
Text Generation: A Systematic Literature Review of Tasks, Evaluation,
  and Challenges
Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges
Jonas Becker
Jan Philip Wahle
Bela Gipp
Terry Ruas
37
9
0
24 May 2024
Linearly Controlled Language Generation with Performative Guarantees
Linearly Controlled Language Generation with Performative Guarantees
Emily Cheng
Marco Baroni
Carmen Amo Alonso
48
3
0
24 May 2024
Robustifying Safety-Aligned Large Language Models through Clean Data
  Curation
Robustifying Safety-Aligned Large Language Models through Clean Data Curation
Xiaoqun Liu
Jiacheng Liang
Muchao Ye
Zhaohan Xi
AAML
53
18
0
24 May 2024
Large Language Model Sentinel: LLM Agent for Adversarial Purification
Large Language Model Sentinel: LLM Agent for Adversarial Purification
Guang Lin
Qibin Zhao
Qibin Zhao
AAML
56
2
0
24 May 2024
Semantic-guided Prompt Organization for Universal Goal Hijacking against
  LLMs
Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs
Yihao Huang
Chong Wang
Xiaojun Jia
Qing Guo
Felix Juefei Xu
Jian Zhang
G. Pu
Yang Liu
36
9
0
23 May 2024
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
43
9
0
22 May 2024
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with
  Minimal Impact on Coherence and Evasiveness in Dialogue Agents
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim
Gary Geunbae Lee
AAML
43
3
0
21 May 2024
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model
  Against LLM Red-Teaming
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
Jiaxu Liu
Xiangyu Yin
Sihao Wu
Jianhong Wang
Meng Fang
Xinping Yi
Xiaowei Huang
34
5
0
21 May 2024
Unveiling and Manipulating Prompt Influence in Large Language Models
Unveiling and Manipulating Prompt Influence in Large Language Models
Zijian Feng
Hanzhang Zhou
Zixiao Zhu
Junlang Qian
Kezhi Mao
45
2
0
20 May 2024
Sociotechnical Implications of Generative Artificial Intelligence for
  Information Access
Sociotechnical Implications of Generative Artificial Intelligence for Information Access
Bhaskar Mitra
Henriette Cramer
Olya Gurevich
50
2
0
19 May 2024
MBIAS: Mitigating Bias in Large Language Models While Retaining Context
MBIAS: Mitigating Bias in Large Language Models While Retaining Context
Shaina Raza
Ananya Raval
Veronica Chatrath
50
6
0
18 May 2024
Exploring Subjectivity for more Human-Centric Assessment of Social
  Biases in Large Language Models
Exploring Subjectivity for more Human-Centric Assessment of Social Biases in Large Language Models
Paula Akemi Aoyagui
Sharon Ferguson
Anastasia Kuzminykh
55
0
0
17 May 2024
Realistic Evaluation of Toxicity in Large Language Models
Realistic Evaluation of Toxicity in Large Language Models
Tinh Son Luong
Thanh-Thien Le
Linh Ngo Van
Thien Huu Nguyen
LM&MA
25
4
0
17 May 2024
Thinking Fair and Slow: On the Efficacy of Structured Prompts for
  Debiasing Language Models
Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models
Shaz Furniturewala
Surgan Jandial
Abhinav Java
Pragyan Banerjee
Simra Shahid
Sumita Bhatia
Kokil Jaidka
57
9
0
16 May 2024
Quite Good, but Not Enough: Nationality Bias in Large Language Models --
  A Case Study of ChatGPT
Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT
Shucheng Zhu
Weikang Wang
Ying Liu
42
5
0
11 May 2024
Mitigating Exaggerated Safety in Large Language Models
Mitigating Exaggerated Safety in Large Language Models
Ruchi Bhalani
Ruchira Ray
37
1
0
08 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM
  Generated Conversations
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Preetam Prabhu Srikar Dammu
Hayoung Jung
Anjali Singh
Monojit Choudhury
Tanushree Mitra
42
8
0
08 May 2024
AffirmativeAI: Towards LGBTQ+ Friendly Audit Frameworks for Large
  Language Models
AffirmativeAI: Towards LGBTQ+ Friendly Audit Frameworks for Large Language Models
Yinru Long
Zilin Ma
Yiyang Mei
Zhaoyuan Su
AI4MH
40
0
0
07 May 2024
FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in
  Large Language Models
FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models
Yanhong Bai
Jiabao Zhao
Jinxin Shi
Zhentao Xie
Xingjiao Wu
Liang He
40
3
0
06 May 2024
Get more for less: Principled Data Selection for Warming Up Fine-Tuning
  in LLMs
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Feiyang Kang
H. Just
Yifan Sun
Himanshu Jahagirdar
Yuanzhi Zhang
Rongxing Du
Anit Kumar Sahu
Ruoxi Jia
56
19
0
05 May 2024
Controllable Text Generation in the Instruction-Tuning Era
Controllable Text Generation in the Instruction-Tuning Era
D. Ashok
Barnabas Poczos
45
6
0
02 May 2024
More RLHF, More Trust? On The Impact of Human Preference Alignment On
  Language Model Trustworthiness
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Aaron Jiaxun Li
Satyapriya Krishna
Himabindu Lakkaraju
48
3
0
29 Apr 2024
LangBiTe: A Platform for Testing Bias in Large Language Models
LangBiTe: A Platform for Testing Bias in Large Language Models
Sergio Morales
Robert Clarisó
Jordi Cabot
23
2
0
29 Apr 2024
SOUL: Unlocking the Power of Second-Order Optimization for LLM
  Unlearning
SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning
Jinghan Jia
Yihua Zhang
Yimeng Zhang
Jiancheng Liu
Bharat Runwal
James Diffenderfer
B. Kailkhura
Sijia Liu
MU
45
35
0
28 Apr 2024
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
Adrian de Wynter
Ishaan Watts
Nektar Ege Altıntoprak
Tua Wongsangaroonsri
Minghui Zhang
...
Anna Vickers
Stéphanie Visser
Herdyan Widarmanto
A. Zaikin
Si-Qing Chen
LM&MA
54
15
0
22 Apr 2024
Stepwise Alignment for Constrained Language Model Policy Optimization
Stepwise Alignment for Constrained Language Model Policy Optimization
Akifumi Wachi
Thien Q. Tran
Rei Sato
Takumi Tanabe
Yohei Akimoto
39
5
0
17 Apr 2024
Previous
123456...141516
Next