ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2009.11462
  4. Cited By
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
  Models

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
ArXivPDFHTML

Papers citing "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"

50 / 772 papers shown
Title
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
51
89
0
19 Feb 2024
Multi-Task Inference: Can Large Language Models Follow Multiple
  Instructions at Once?
Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?
Guijin Son
Sangwon Baek
Sangdae Nam
Ilgyun Jeong
Seungone Kim
ELM
LRM
40
14
0
18 Feb 2024
Controlled Text Generation for Large Language Model with Dynamic
  Attribute Graphs
Controlled Text Generation for Large Language Model with Dynamic Attribute Graphs
Xun Liang
Hanyu Wang
Shichao Song
Mengting Hu
Xunzhi Wang
Zhiyu Li
Zhiyu Li
Simin Niu
28
9
0
17 Feb 2024
Disclosure and Mitigation of Gender Bias in LLMs
Disclosure and Mitigation of Gender Bias in LLMs
Xiangjue Dong
Yibo Wang
Philip S. Yu
James Caverlee
12
30
0
17 Feb 2024
Direct Preference Optimization with an Offset
Direct Preference Optimization with an Offset
Afra Amini
Tim Vieira
Ryan Cotterell
73
55
0
16 Feb 2024
Representation Surgery: Theory and Practice of Affine Steering
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh
Shauli Ravfogel
Jonathan Herzig
Roee Aharoni
Ryan Cotterell
Ponnurangam Kumaraguru
LLMSV
35
13
0
15 Feb 2024
AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe
  Approach
AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach
Maryam Amirizaniani
Elias Martin
Tanya Roosta
Aman Chadha
Chirag Shah
26
2
0
14 Feb 2024
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong
Zhanhui Zhou
Chao Yang
Jing Shao
Yu Qiao
ELM
52
58
0
14 Feb 2024
Evaluating the Experience of LGBTQ+ People Using Large Language Model
  Based Chatbots for Mental Health Support
Evaluating the Experience of LGBTQ+ People Using Large Language Model Based Chatbots for Mental Health Support
Zilin Ma
Yiyang Mei
Yinru Long
Zhaoyuan Su
Krzysztof Z. Gajos
AI4MH
24
19
0
14 Feb 2024
Rethinking Machine Unlearning for Large Language Models
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILaw
MU
82
86
0
13 Feb 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language
  Model
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
Ahmet Üstün
Viraat Aryabumi
Zheng-Xin Yong
Wei-Yin Ko
Daniel D'souza
...
Shayne Longpre
Niklas Muennighoff
Marzieh Fadaee
Julia Kreutzer
Sara Hooker
ALM
ELM
SyDa
LRM
40
200
0
12 Feb 2024
How do Large Language Models Navigate Conflicts between Honesty and
  Helpfulness?
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu
T. Sumers
Ishita Dasgupta
Thomas Griffiths
LLMAG
43
14
0
11 Feb 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
KELM
55
25
0
09 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large
  Language Models
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
30
88
0
07 Feb 2024
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test
  Guideline Adherence of Large Language Models
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Haibo Jin
Ruoxi Chen
Andy Zhou
Yang Zhang
Haohan Wang
LLMAG
24
22
0
05 Feb 2024
Large Language Models are Geographically Biased
Large Language Models are Geographically Biased
Rohin Manvi
Samar Khanna
Marshall Burke
David B. Lobell
Stefano Ermon
47
43
0
05 Feb 2024
Jailbreaking Attack against Multimodal Large Language Model
Jailbreaking Attack against Multimodal Large Language Model
Zhenxing Niu
Haoxuan Ji
Xinbo Gao
Gang Hua
Rong Jin
50
61
0
04 Feb 2024
Self-Debiasing Large Language Models: Zero-Shot Recognition and
  Reduction of Stereotypes
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes
Isabel O. Gallegos
Ryan A. Rossi
Joe Barrow
Md Mehrab Tanjim
Tong Yu
Hanieh Deilamsalehy
Ruiyi Zhang
Sungchul Kim
Franck Dernoncourt
24
19
0
03 Feb 2024
Building Guardrails for Large Language Models
Building Guardrails for Large Language Models
Yizhen Dong
Ronghui Mu
Gao Jin
Yi Qi
Jinwei Hu
Xingyu Zhao
Jie Meng
Wenjie Ruan
Xiaowei Huang
OffRL
65
28
0
02 Feb 2024
Trustworthy Distributed AI Systems: Robustness, Privacy, and Governance
Trustworthy Distributed AI Systems: Robustness, Privacy, and Governance
Wenqi Wei
Ling Liu
33
16
0
02 Feb 2024
Instruction Makes a Difference
Instruction Makes a Difference
Tosin Adewumi
Nudrat Habib
Lama Alkhaled
Elisa Barney
VLM
MLLM
26
1
0
01 Feb 2024
LLaMandement: Large Language Models for Summarization of French
  Legislative Proposals
LLaMandement: Large Language Models for Summarization of French Legislative Proposals
Joseph Gesnouin
Yannis Tannier
Christophe Gomes Da Silva
Hatim Tapory
Camille Brier
...
Emmanuel Cortes
Pierre-Etienne Devineau
Ulrich Tan
Esther Mac Namara
Su Yang
AILaw
44
8
0
29 Jan 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
47
68
0
29 Jan 2024
ARGS: Alignment as Reward-Guided Search
ARGS: Alignment as Reward-Guided Search
Maxim Khanov
Jirayu Burapacheep
Yixuan Li
40
48
0
23 Jan 2024
From Understanding to Utilization: A Survey on Explainability for Large
  Language Models
From Understanding to Utilization: A Survey on Explainability for Large Language Models
Haoyan Luo
Lucia Specia
56
22
0
23 Jan 2024
Contrastive Perplexity for Controlled Generation: An Application in
  Detoxifying Large Language Models
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
T. Klein
Moin Nabi
26
1
0
16 Jan 2024
Understanding User Experience in Large Language Model Interactions
Understanding User Experience in Large Language Model Interactions
Jiayin Wang
Weizhi Ma
Peijie Sun
Min Zhang
Jian-yun Nie
27
32
0
16 Jan 2024
Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language
  Model Critique in Text Generation
Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation
Meng Cao
Lei Shu
Lei Yu
Yun Zhu
Nevan Wichers
Yinxiao Liu
Lei Meng
OffRL
ALM
27
4
0
14 Jan 2024
Parameter-Efficient Detoxification with Contrastive Decoding
Parameter-Efficient Detoxification with Contrastive Decoding
Tong Niu
Caiming Xiong
Semih Yavuz
Yingbo Zhou
33
12
0
13 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
26
265
0
12 Jan 2024
Combating Adversarial Attacks with Multi-Agent Debate
Combating Adversarial Attacks with Multi-Agent Debate
Steffi Chern
Zhen Fan
Andy Liu
AAML
45
5
0
11 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language
  Model Systems
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
Zujie Wen
Ke Xu
Qi Li
63
57
0
11 Jan 2024
Understanding LLMs: A Comprehensive Overview from Training to Inference
Understanding LLMs: A Comprehensive Overview from Training to Inference
Yi-Hsueh Liu
Haoyang He
Tianle Han
Xu-Yao Zhang
Mengyuan Liu
...
Xintao Hu
Tuo Zhang
Ning Qiang
Tianming Liu
Bao Ge
SyDa
37
65
0
04 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO
  and Toxicity
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
77
104
0
03 Jan 2024
A Comprehensive Study of Knowledge Editing for Large Language Models
A Comprehensive Study of Knowledge Editing for Large Language Models
Ningyu Zhang
Yunzhi Yao
Bo Tian
Peng Wang
Shumin Deng
...
Lei Liang
Qing Cui
Xiao-Jun Zhu
Jun Zhou
Huajun Chen
KELM
55
77
0
02 Jan 2024
Benchmarking Large Language Models on Controllable Generation under
  Diversified Instructions
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
Yihan Chen
Benfeng Xu
Quan Wang
Yi Liu
Zhendong Mao
ALM
ELM
29
26
0
01 Jan 2024
Align on the Fly: Adapting Chatbot Behavior to Established Norms
Align on the Fly: Adapting Chatbot Behavior to Established Norms
Chunpu Xu
Steffi Chern
Ethan Chern
Ge Zhang
Zekun Wang
Ruibo Liu
Jing Li
Jie Fu
Pengfei Liu
24
20
0
26 Dec 2023
Time is Encoded in the Weights of Finetuned Language Models
Time is Encoded in the Weights of Finetuned Language Models
Kai Nylund
Suchin Gururangan
Noah A. Smith
AI4TS
36
18
0
20 Dec 2023
Learning and Forgetting Unsafe Examples in Large Language Models
Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao
Zhun Deng
David Madras
James Zou
Mengye Ren
MU
KELM
CLL
94
17
0
20 Dec 2023
Faithful Model Evaluation for Model-Based Metrics
Faithful Model Evaluation for Model-Based Metrics
Palash Goyal
Qian Hu
Rahul Gupta
6
1
0
19 Dec 2023
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
38
13
0
13 Dec 2023
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an
  In-Context Attack
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
Yu Fu
Yufei Li
Wen Xiao
Cong Liu
Yue Dong
AAML
45
5
0
12 Dec 2023
Unlocking Anticipatory Text Generation: A Constrained Approach for Large
  Language Models Decoding
Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding
Lifu Tu
Semih Yavuz
Jin Qu
Jiacheng Xu
Rui Meng
Caiming Xiong
Yingbo Zhou
24
1
0
11 Dec 2023
A Block Metropolis-Hastings Sampler for Controllable Energy-based Text
  Generation
A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation
Jarad Forristal
Niloofar Mireshghallah
Greg Durrett
Taylor Berg-Kirkpatrick
121
4
0
07 Dec 2023
A Pseudo-Semantic Loss for Autoregressive Models with Logical
  Constraints
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints
Kareem Ahmed
Kai-Wei Chang
Guy Van den Broeck
50
11
0
06 Dec 2023
Weakly Supervised Detection of Hallucinations in LLM Activations
Weakly Supervised Detection of Hallucinations in LLM Activations
Miriam Rateike
C. Cintas
John Wamburu
Tanya Akumu
Skyler Speakman
33
11
0
05 Dec 2023
A Survey on Large Language Model (LLM) Security and Privacy: The Good,
  the Bad, and the Ugly
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Yifan Yao
Jinhao Duan
Kaidi Xu
Yuanfang Cai
Eric Sun
Yue Zhang
PILM
ELM
57
478
0
04 Dec 2023
Tackling Bias in Pre-trained Language Models: Current Trends and
  Under-represented Societies
Tackling Bias in Pre-trained Language Models: Current Trends and Under-represented Societies
Vithya Yogarajan
Gillian Dobbie
Te Taka Keegan
R. Neuwirth
ALM
54
11
0
03 Dec 2023
Personality of AI
Personality of AI
Byunggu Yu
Junwhan Kim
22
0
0
03 Dec 2023
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark
  Dataset for Generative Language Models in Norwegian
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian
Peng Liu
Lemei Zhang
Terje Nissen Farup
Even W. Lauvrak
Jon Espen Ingvaldsen
Simen Eide
J. Gulla
Zhirong Yang
ELM
35
6
0
03 Dec 2023
Previous
123...678...141516
Next