Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.11462
Cited By
v1
v2 (latest)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
50 / 814 papers shown
Title
TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation
Gwen Yidou Weng
Benjie Wang
Guy Van den Broeck
BDL
423
0
0
25 Apr 2025
Combating Toxic Language: A Review of LLM-Based Strategies for Software Engineering
Hao Zhuo
Yicheng Yang
Kewen Peng
55
0
0
21 Apr 2025
aiXamine: Simplified LLM Safety and Security
Fatih Deniz
Dorde Popovic
Yazan Boshmaf
Euisuh Jeong
M. Ahmad
Sanjay Chawla
Issa M. Khalil
ELM
341
0
0
21 Apr 2025
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
Ziwen Xu
Shuxun Wang
Kewei Xu
Haoming Xu
Mengru Wang
Xinle Deng
Yunzhi Yao
Guozhou Zheng
Ningyu Zhang
Xin Xu
KELM
LLMSV
481
1
0
21 Apr 2025
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Takuma Udagawa
Yang Zhao
H. Kanayama
Bishwaranjan Bhattacharjee
65
0
0
19 Apr 2025
Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages
Alessio Buscemi
Cedric Lothritz
Sergio Morales
Marcos Gomez-Vazquez
Robert Clarisó
Jordi Cabot
German Castignani
57
0
0
19 Apr 2025
Benchmarking Multi-National Value Alignment for Large Language Models
Chengyi Ju
Weijie Shi
Chengzhong Liu
Yalan Qin
Jipeng Zhang
...
Jia Zhu
Jiajie Xu
Yaodong Yang
Sirui Han
Yike Guo
476
2
0
17 Apr 2025
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu
Yuemei Xu
Xiaoyang Gu
Letao Han
153
0
0
07 Apr 2025
Increasing happiness through conversations with artificial intelligence
Joseph Heffner
Chongyu Qin
Martin Chadwick
Chris Knutsen
Christopher Summerfield
Zeb Kurth-Nelson
Robb B. Rutledge
AI4MH
90
0
0
02 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Ziqiang Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
152
0
0
02 Apr 2025
The Mind in the Machine: A Survey of Incorporating Psychological Theories in LLMs
Zizhou Liu
Ziwei Gong
Lin Ai
Zheng Hui
Run Chen
Colin Wayne Leach
Michelle R. Greene
Julia Hirschberg
LLMAG
489
0
0
28 Mar 2025
Shared Global and Local Geometry of Language Model Embeddings
Andrew Lee
Melanie Weber
F. Viégas
Martin Wattenberg
FedML
123
7
0
27 Mar 2025
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
Wenhao You
Bryan Hooi
Yiwei Wang
Yansen Wang
Zong Ke
Ming Yang
Zi Huang
Yujun Cai
AAML
100
0
0
24 Mar 2025
CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning
Andrew Rufail
Daniel Kim
Sean O'Brien
Kevin Zhu
LRM
71
0
0
24 Mar 2025
Opportunities and Challenges of Frontier Data Governance With Synthetic Data
Madhavendra Thakur
Jason Hausenloy
97
0
0
21 Mar 2025
Through the LLM Looking Glass: A Socratic Probing of Donkeys, Elephants, and Markets
Molly Kennedy
Ayyoob Imani
Timo Spinde
Hinrich Schütze
100
1
0
20 Mar 2025
LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates
Ying Shen
Lifu Huang
104
2
0
20 Mar 2025
DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification
Cho Hyeonsu
Dooyoung Kim
Youngjoong Ko
MoMe
78
0
0
17 Mar 2025
Aligned Probing: Relating Toxic Behavior and Model Internals
Andreas Waldis
Vagrant Gautam
Anne Lauscher
Dietrich Klakow
Iryna Gurevych
72
1
0
17 Mar 2025
The Amazon Nova Family of Models: Technical Report and Model Card
Amazon AGI
Aaron Langford
A. Shah
Abhanshu Gupta
Abhimanyu Bhatter
...
Benjamin Biggs
Benjamin Ott
Bhanu Vinzamuri
Bharath Venkatesh
Bhavana Ganesh
26
21
0
17 Mar 2025
No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models
Charaka Vinayak Kumar
Ashok Urlana
Gopichand Kanumolu
B. Garlapati
Pruthwik Mishra
ELM
103
1
0
15 Mar 2025
Mimicking How Humans Interpret Out-of-Context Sentences Through Controlled Toxicity Decoding
Maria Mihaela Trusca
Liesbeth Allein
77
0
0
11 Mar 2025
PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models
Michael-Andrei Panaitescu-Liess
Pankayaraj Pathmanathan
Yigitcan Kaya
Zora Che
Bang An
Sicheng Zhu
Aakriti Agrawal
Furong Huang
AAML
118
2
0
10 Mar 2025
Red Team Diffuser: Exposing Toxic Continuation Vulnerabilities in Vision-Language Models via Reinforcement Learning
Ruofan Wang
Xiang Zheng
Xinyu Wang
Cong Wang
Jie Zhang
VLM
66
0
0
08 Mar 2025
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models
Niccolò Turcato
Matteo Iovino
Aris Synodinos
Alberto Dalla Libera
R. Carli
Pietro Falco
LM&Ro
124
0
0
06 Mar 2025
Pragmatic Inference Chain (PIC) Improving LLMs' Reasoning of Authentic Implicit Toxic Language
Xi Chen
Shuo Wang
ReLM
LRM
120
0
0
03 Mar 2025
Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts
Akito Nakanishi
Yukie Sano
Geng Liu
Francesco Pierri
95
0
0
03 Mar 2025
A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information
Lucky Susanto
M. Wijanarko
Prasetia Anugrah Pratama
Zilu Tang
Fariz Akyas
Traci Hong
Ika Idris
Alham Fikri Aji
Derry Wijaya
68
0
0
01 Mar 2025
The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Yihong Tang
Kehai Chen
X. Bai
Zhengyu Niu
Binghai Wang
Jie Liu
Min Zhang
LLMAG
101
0
0
28 Feb 2025
EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models
Che Hyun Lee
Heeseung Kim
Jiheum Yeom
Sungroh Yoon
DiffM
123
1
0
27 Feb 2025
Conformal Tail Risk Control for Large Language Model Alignment
Catherine Yu-Chi Chen
Jingyan Shen
Zhun Deng
Lihua Lei
116
1
0
27 Feb 2025
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models
Nastaran Darabi
Devashri Naik
Sina Tayebati
Dinithi Jayasuriya
Ranganath Krishnan
A. R. Trivedi
AAML
165
0
0
24 Feb 2025
The Call for Socially Aware Language Technologies
Diyi Yang
Dirk Hovy
David Jurgens
Barbara Plank
VLM
153
12
0
24 Feb 2025
Encoding Inequity: Examining Demographic Bias in LLM-Driven Robot Caregiving
Raj Korpan
75
0
0
24 Feb 2025
CHBench: A Chinese Dataset for Evaluating Health in Large Language Models
Chenlu Guo
Nuo Xu
Yi-Ju Chang
Yuan Wu
AI4MH
LM&MA
118
2
0
24 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
Volkan Cevher
AAML
118
0
0
24 Feb 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
Haoyang Li
Tao Xie
Baishakhi Ray
97
6
0
23 Feb 2025
What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets
Marco Antonio Stranisci
Christian Hardmeier
165
1
0
17 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil
Vipul Gupta
Sarkar Snigdha Sarathi Das
R. Passonneau
434
0
0
07 Feb 2025
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
H. Malik
Fahad Shamshad
Muzammal Naseer
Karthik Nandakumar
Fahad Shahbaz Khan
Salman Khan
AAML
MLLM
VLM
141
1
0
03 Feb 2025
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
Isha Gupta
David Khachaturov
Robert D. Mullins
AAML
AuLLM
121
4
0
02 Feb 2025
Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models
Yuxuan Li
Hirokazu Shirado
Sauvik Das
68
5
0
29 Jan 2025
Risk-Aware Distributional Intervention Policies for Language Models
Bao Nguyen
Binh Nguyen
Duy Nguyen
V. Nguyen
119
2
0
28 Jan 2025
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models
Abdulkadir Erol
Trilok Padhi
Agnik Saha
Ugur Kursuncu
Mehmet Emin Aktas
97
2
0
17 Jan 2025
Clinical Insights: A Comprehensive Review of Language Models in Medicine
Nikita Neveditsin
Pawan Lingras
V. Mago
LM&MA
127
5
0
08 Jan 2025
LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases
Dylan Bouchard
Mohit Singh Chauhan
David Skarbrevik
Viren Bajaj
Zeya Ahmad
83
0
0
06 Jan 2025
CALM: Curiosity-Driven Auditing for Large Language Models
Xiang Zheng
Longxiang Wang
Yi Liu
Jie Zhang
Chao Shen
Cong Wang
MLAU
101
2
0
06 Jan 2025
LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena
Stefan Pasch
167
0
0
04 Jan 2025
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
78
7
0
24 Dec 2024
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Zaitang Li
Pin-Yu Chen
Tsung-Yi Ho
AAML
58
0
0
23 Dec 2024
Previous
1
2
3
4
5
...
15
16
17
Next