ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2009.11462
  4. Cited By
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
  Models
v1v2 (latest)

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
ArXiv (abs)PDFHTML

Papers citing "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"

50 / 814 papers shown
Title
Representation Surgery: Theory and Practice of Affine Steering
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh
Shauli Ravfogel
Jonathan Herzig
Roee Aharoni
Ryan Cotterell
Ponnurangam Kumaraguru
LLMSV
77
16
0
15 Feb 2024
AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe
  Approach
AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach
Maryam Amirizaniani
Elias Martin
Tanya Roosta
Aman Chadha
Chirag Shah
75
3
0
14 Feb 2024
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong
Zhanhui Zhou
Chao Yang
Jing Shao
Yu Qiao
ELM
132
68
0
14 Feb 2024
Evaluating the Experience of LGBTQ+ People Using Large Language Model
  Based Chatbots for Mental Health Support
Evaluating the Experience of LGBTQ+ People Using Large Language Model Based Chatbots for Mental Health Support
Zilin Ma
Yiyang Mei
Yinru Long
Zhaoyuan Su
Krzysztof Z. Gajos
AI4MH
74
26
0
14 Feb 2024
Rethinking Machine Unlearning for Large Language Models
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILawMU
188
120
0
13 Feb 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language
  Model
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
Ahmet Üstün
Viraat Aryabumi
Zheng-Xin Yong
Wei-Yin Ko
Daniel D'souza
...
Shayne Longpre
Niklas Muennighoff
Marzieh Fadaee
Julia Kreutzer
Sara Hooker
ALMELMSyDaLRM
98
231
0
12 Feb 2024
How do Large Language Models Navigate Conflicts between Honesty and
  Helpfulness?
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu
T. Sumers
Ishita Dasgupta
Thomas Griffiths
LLMAG
76
17
0
11 Feb 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
KELM
98
33
0
09 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large
  Language Models
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
129
106
0
07 Feb 2024
Large Language Models are Geographically Biased
Large Language Models are Geographically Biased
Rohin Manvi
Samar Khanna
Marshall Burke
David B. Lobell
Stefano Ermon
107
54
0
05 Feb 2024
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Haibo Jin
Ruoxi Chen
Peiyan Zhang
Andy Zhou
Yang Zhang
Haohan Wang
LLMAG
111
28
0
05 Feb 2024
Jailbreaking Attack against Multimodal Large Language Model
Jailbreaking Attack against Multimodal Large Language Model
Zhenxing Niu
Haoxuan Ji
Xinbo Gao
Gang Hua
Rong Jin
97
76
0
04 Feb 2024
Self-Debiasing Large Language Models: Zero-Shot Recognition and
  Reduction of Stereotypes
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes
Isabel O. Gallegos
Ryan Rossi
Joe Barrow
Md Mehrab Tanjim
Tong Yu
Hanieh Deilamsalehy
Ruiyi Zhang
Sungchul Kim
Franck Dernoncourt
71
23
0
03 Feb 2024
Building Guardrails for Large Language Models
Building Guardrails for Large Language Models
Yizhen Dong
Ronghui Mu
Gao Jin
Yi Qi
Jinwei Hu
Xingyu Zhao
Jie Meng
Wenjie Ruan
Xiaowei Huang
OffRL
134
32
0
02 Feb 2024
Trustworthy Distributed AI Systems: Robustness, Privacy, and Governance
Trustworthy Distributed AI Systems: Robustness, Privacy, and Governance
Wenqi Wei
Ling Liu
128
20
0
02 Feb 2024
Instruction Makes a Difference
Instruction Makes a Difference
Tosin Adewumi
Nudrat Habib
Lama Alkhaled
Elisa Barney
VLMMLLM
69
1
0
01 Feb 2024
LLaMandement: Large Language Models for Summarization of French
  Legislative Proposals
LLaMandement: Large Language Models for Summarization of French Legislative Proposals
Joseph Gesnouin
Yannis Tannier
Christophe Gomes Da Silva
Hatim Tapory
Camille Brier
...
Emmanuel Cortes
Pierre-Etienne Devineau
Ulrich Tan
Esther Mac Namara
Su Yang
AILaw
90
8
0
29 Jan 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
123
75
0
29 Jan 2024
ARGS: Alignment as Reward-Guided Search
ARGS: Alignment as Reward-Guided Search
Maxim Khanov
Jirayu Burapacheep
Yixuan Li
130
62
0
23 Jan 2024
From Understanding to Utilization: A Survey on Explainability for Large
  Language Models
From Understanding to Utilization: A Survey on Explainability for Large Language Models
Haoyan Luo
Lucia Specia
136
25
0
23 Jan 2024
Understanding User Experience in Large Language Model Interactions
Understanding User Experience in Large Language Model Interactions
Jiayin Wang
Weizhi Ma
Peijie Sun
Min Zhang
Jian-yun Nie
83
36
0
16 Jan 2024
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
T. Klein
Moin Nabi
73
1
0
16 Jan 2024
Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language
  Model Critique in Text Generation
Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation
Meng Cao
Lei Shu
Lei Yu
Yun Zhu
Nevan Wichers
Yinxiao Liu
Lei Meng
OffRLALM
60
7
0
14 Jan 2024
Parameter-Efficient Detoxification with Contrastive Decoding
Parameter-Efficient Detoxification with Contrastive Decoding
Tong Niu
Caiming Xiong
Semih Yavuz
Yingbo Zhou
58
14
0
13 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
131
318
0
12 Jan 2024
Combating Adversarial Attacks with Multi-Agent Debate
Combating Adversarial Attacks with Multi-Agent Debate
Steffi Chern
Zhen Fan
Andy Liu
AAML
69
8
0
11 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language
  Model Systems
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
ZuJie Wen
Ke Xu
Qi Li
165
64
0
11 Jan 2024
Understanding LLMs: A Comprehensive Overview from Training to Inference
Understanding LLMs: A Comprehensive Overview from Training to Inference
Yi-Hsueh Liu
Haoyang He
Tianle Han
Xu-Yao Zhang
Mengyuan Liu
...
Xintao Hu
Tuo Zhang
Ning Qiang
Tianming Liu
Bao Ge
SyDa
166
80
0
04 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO
  and Toxicity
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
150
121
0
03 Jan 2024
A Comprehensive Study of Knowledge Editing for Large Language Models
A Comprehensive Study of Knowledge Editing for Large Language Models
Ningyu Zhang
Yunzhi Yao
Bo Tian
Peng Wang
Shumin Deng
...
Lei Liang
Qing Cui
Xiao-Jun Zhu
Jun Zhou
Huajun Chen
KELM
173
89
0
02 Jan 2024
Benchmarking Large Language Models on Controllable Generation under
  Diversified Instructions
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
Yihan Chen
Benfeng Xu
Quan Wang
Yi Liu
Zhendong Mao
ALMELM
87
29
0
01 Jan 2024
Align on the Fly: Adapting Chatbot Behavior to Established Norms
Align on the Fly: Adapting Chatbot Behavior to Established Norms
Chunpu Xu
Steffi Chern
Ethan Chern
Ge Zhang
Zekun Wang
Ruibo Liu
Jing Li
Jie Fu
Pengfei Liu
79
20
0
26 Dec 2023
Time is Encoded in the Weights of Finetuned Language Models
Time is Encoded in the Weights of Finetuned Language Models
Kai Nylund
Suchin Gururangan
Noah A. Smith
AI4TS
157
26
0
20 Dec 2023
Learning and Forgetting Unsafe Examples in Large Language Models
Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao
Zhun Deng
David Madras
James Zou
Mengye Ren
MUKELMCLL
152
18
0
20 Dec 2023
Faithful Model Evaluation for Model-Based Metrics
Faithful Model Evaluation for Model-Based Metrics
Palash Goyal
Qian Hu
Rahul Gupta
25
1
0
19 Dec 2023
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
75
15
0
13 Dec 2023
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an
  In-Context Attack
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
Yu Fu
Yufei Li
Wen Xiao
Cong Liu
Yue Dong
AAML
105
5
0
12 Dec 2023
Unlocking Anticipatory Text Generation: A Constrained Approach for Large
  Language Models Decoding
Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding
Lifu Tu
Semih Yavuz
Jin Qu
Jiacheng Xu
Rui Meng
Caiming Xiong
Yingbo Zhou
43
1
0
11 Dec 2023
A Block Metropolis-Hastings Sampler for Controllable Energy-based Text
  Generation
A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation
Jarad Forristal
Niloofar Mireshghallah
Greg Durrett
Taylor Berg-Kirkpatrick
157
6
0
07 Dec 2023
A Pseudo-Semantic Loss for Autoregressive Models with Logical
  Constraints
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints
Kareem Ahmed
Kai-Wei Chang
Guy Van den Broeck
141
12
0
06 Dec 2023
Weakly Supervised Detection of Hallucinations in LLM Activations
Weakly Supervised Detection of Hallucinations in LLM Activations
Miriam Rateike
C. Cintas
John Wamburu
Tanya Akumu
Skyler Speakman
81
14
0
05 Dec 2023
A Survey on Large Language Model (LLM) Security and Privacy: The Good,
  the Bad, and the Ugly
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Yifan Yao
Jinhao Duan
Kaidi Xu
Yuanfang Cai
Eric Sun
Yue Zhang
PILMELM
128
569
0
04 Dec 2023
Tackling Bias in Pre-trained Language Models: Current Trends and
  Under-represented Societies
Tackling Bias in Pre-trained Language Models: Current Trends and Under-represented Societies
Vithya Yogarajan
Gillian Dobbie
Te Taka Keegan
R. Neuwirth
ALM
98
13
0
03 Dec 2023
Personality of AI
Personality of AI
Byunggu Yu
Junwhan Kim
37
1
0
03 Dec 2023
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark
  Dataset for Generative Language Models in Norwegian
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian
Peng Liu
Lemei Zhang
Terje Nissen Farup
Even W. Lauvrak
Jon Espen Ingvaldsen
Simen Eide
J. Gulla
Zhirong Yang
ELM
97
6
0
03 Dec 2023
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with
  Factuality, Fairness, Toxicity
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity
Shiyao Cui
Zhenyu Zhang
Yilong Chen
Wenyuan Zhang
Tianyun Liu
Siqi Wang
Tingwen Liu
97
17
0
30 Nov 2023
Fair Text-to-Image Diffusion via Fair Mapping
Fair Text-to-Image Diffusion via Fair Mapping
Jia Li
Lijie Hu
Jingfeng Zhang
Tianhang Zheng
Hua Zhang
Di Wang
155
17
0
29 Nov 2023
Unveiling the Implicit Toxicity in Large Language Models
Unveiling the Implicit Toxicity in Large Language Models
Jiaxin Wen
Pei Ke
Hao Sun
Zhexin Zhang
Chengfei Li
Jinfeng Bai
Minlie Huang
75
31
0
29 Nov 2023
SoUnD Framework: Analyzing (So)cial Representation in (Un)structured
  (D)ata
SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata
Mark Díaz
Sunipa Dev
Emily Reif
Remi Denton
Vinodkumar Prabhakaran
103
4
0
28 Nov 2023
DUnE: Dataset for Unified Editing
DUnE: Dataset for Unified Editing
Afra Feyza Akyürek
Eric Pan
Garry Kuwanto
Derry Wijaya
KELM
86
18
0
27 Nov 2023
Previous
123...789...151617
Next