Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.11462
Cited By
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
Re-assign community
ArXiv
PDF
HTML
Papers citing
"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
50 / 772 papers shown
Title
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Y. Ong
Qi Wang
Ke Tang
2
0
0
17 May 2025
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs
Sijia Chen
Xiaomin Li
Mengxue Zhang
Eric Hanchen Jiang
Qingcheng Zeng
Chen-Hsiang Yu
AAML
MU
ELM
27
0
0
16 May 2025
Retrieval Augmented Generation Evaluation for Health Documents
Mario Ceresa
Lorenzo Bertolini
Valentin Comte
Nicholas Spadaro
Barbara Raffael
...
Sergio Consoli
Amalia Muñoz Piñeiro
Alex Patak
Maddalena Querci
Tobias Wiesenthal
RALM
3DV
39
0
1
07 May 2025
Teaching Models to Understand (but not Generate) High-risk Data
Ryan Yixiang Wang
Matthew Finlayson
Luca Soldaini
Swabha Swayamdipta
Robin Jia
136
0
0
05 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
39
0
0
05 May 2025
Semantic Probabilistic Control of Language Models
Kareem Ahmed
Catarina G Belém
Padhraic Smyth
Sameer Singh
44
0
0
04 May 2025
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Sai Krishna Mendu
Harish Yenala
Aditi Gulati
Shanu Kumar
Parag Agrawal
36
0
0
04 May 2025
SAGE
\texttt{SAGE}
SAGE
: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
47
0
0
28 Apr 2025
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
81
0
0
27 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
86
2
0
26 Apr 2025
TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation
Gwen Yidou Weng
Benjie Wang
Mathias Niepert
BDL
155
0
0
25 Apr 2025
MODP: Multi Objective Directional Prompting
Aashutosh Nema
Samaksh Gulati
Evangelos Giakoumakis
Bipana Thapaliya
LLMAG
54
0
0
25 Apr 2025
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
Ziwen Xu
Shuxun Wang
Kewei Xu
Haoming Xu
Mengru Wang
Xinle Deng
Yunzhi Yao
Guozhou Zheng
H. Chen
Ningyu Zhang
KELM
LLMSV
178
0
0
21 Apr 2025
Combating Toxic Language: A Review of LLM-Based Strategies for Software Engineering
Hao Zhuo
Yicheng Yang
Kewen Peng
25
0
0
21 Apr 2025
aiXamine: Simplified LLM Safety and Security
Fatih Deniz
Dorde Popovic
Yazan Boshmaf
Euisuh Jeong
M. Ahmad
Sanjay Chawla
Issa M. Khalil
ELM
80
0
0
21 Apr 2025
Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages
Alessio Buscemi
Cedric Lothritz
Sergio Morales
Marcos Gomez-Vazquez
Robert Clarisó
Jordi Cabot
German Castignani
31
0
0
19 Apr 2025
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Takuma Udagawa
Yang Zhao
H. Kanayama
Bishwaranjan Bhattacharjee
33
0
0
19 Apr 2025
Benchmarking Multi-National Value Alignment for Large Language Models
Chengyi Ju
Weijie Shi
Chengzhong Liu
Yalan Qin
Jipeng Zhang
...
Jia Zhu
Jiajie Xu
Yaodong Yang
Sirui Han
Yike Guo
158
0
0
17 Apr 2025
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu
Yuemei Xu
Xiaoyang Gu
Letao Han
28
0
0
07 Apr 2025
Increasing happiness through conversations with artificial intelligence
Joseph Heffner
Chongyu Qin
Martin Chadwick
Chris Knutsen
Christopher Summerfield
Zeb Kurth-Nelson
Robb B. Rutledge
AI4MH
44
0
0
02 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Ziqiang Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
44
0
0
02 Apr 2025
The Mind in the Machine: A Survey of Incorporating Psychological Theories in LLMs
Zizhou Liu
Ziwei Gong
Lin Ai
Zheng Hui
Run Chen
Colin Wayne Leach
Michelle R. Greene
Julia Hirschberg
LLMAG
162
0
0
28 Mar 2025
Shared Global and Local Geometry of Language Model Embeddings
Andrew Lee
Melanie Weber
F. Viégas
Martin Wattenberg
FedML
79
3
0
27 Mar 2025
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
Wenhao You
Bryan Hooi
Yiwei Wang
Yansen Wang
Zong Ke
Ming Yang
Zi Huang
Yujun Cai
AAML
61
0
0
24 Mar 2025
CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning
Andrew Rufail
Daniel Kim
Sean O'Brien
Kevin Zhu
LRM
32
0
0
24 Mar 2025
Opportunities and Challenges of Frontier Data Governance With Synthetic Data
Madhavendra Thakur
Jason Hausenloy
51
0
0
21 Mar 2025
Through the LLM Looking Glass: A Socratic Self-Assessment of Donkeys, Elephants, and Markets
Molly Kennedy
Ayyoob Imani
Timo Spinde
Hinrich Schütze
52
1
0
20 Mar 2025
LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates
Ying Shen
Lifu Huang
52
1
0
20 Mar 2025
DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification
Cho Hyeonsu
Dooyoung Kim
Youngjoong Ko
MoMe
46
0
0
17 Mar 2025
Aligned Probing: Relating Toxic Behavior and Model Internals
Andreas Waldis
Vagrant Gautam
Anne Lauscher
Dietrich Klakow
Iryna Gurevych
45
0
0
17 Mar 2025
No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models
Charaka Vinayak Kumar
Ashok Urlana
Gopichand Kanumolu
B. Garlapati
Pruthwik Mishra
ELM
52
0
0
15 Mar 2025
Constrained Language Generation with Discrete Diffusion Models
Michael Cardei
Jacob K Christopher
Thomas Hartvigsen
Brian Bartoldson
B. Kailkhura
Ferdinando Fioretto
DiffM
60
0
0
12 Mar 2025
Mimicking How Humans Interpret Out-of-Context Sentences Through Controlled Toxicity Decoding
Maria Mihaela Trusca
Liesbeth Allein
52
0
0
11 Mar 2025
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models
Niccolò Turcato
Matteo Iovino
Aris Synodinos
Alberto Dalla Libera
R. Carli
Pietro Falco
LM&Ro
43
0
0
06 Mar 2025
Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts
Akito Nakanishi
Yukie Sano
Geng Liu
Francesco Pierri
60
0
0
03 Mar 2025
Pragmatic Inference Chain (PIC) Improving LLMs' Reasoning of Authentic Implicit Toxic Language
Xi Chen
Shuo Wang
ReLM
LRM
47
0
0
03 Mar 2025
A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information
Lucky Susanto
M. Wijanarko
Prasetia Anugrah Pratama
Zilu Tang
Fariz Akyas
Traci Hong
Ika Idris
Alham Fikri Aji
Derry Wijaya
36
0
0
01 Mar 2025
The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Yihong Tang
Kehai Chen
X. Bai
Zhengyu Niu
Binghui Wang
Jie Liu
Min Zhang
LLMAG
54
0
0
28 Feb 2025
EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models
Che Hyun Lee
Heeseung Kim
Jiheum Yeom
Sungroh Yoon
DiffM
61
0
0
27 Feb 2025
Conformal Tail Risk Control for Large Language Model Alignment
Catherine Yu-Chi Chen
Jingyan Shen
Zhun Deng
Lihua Lei
76
0
0
27 Feb 2025
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models
Nastaran Darabi
Devashri Naik
Sina Tayebati
Dinithi Jayasuriya
Ranganath Krishnan
A. R. Trivedi
AAML
52
0
0
24 Feb 2025
Encoding Inequity: Examining Demographic Bias in LLM-Driven Robot Caregiving
Raj Korpan
39
0
0
24 Feb 2025
The Call for Socially Aware Language Technologies
Diyi Yang
Dirk Hovy
David Jurgens
Barbara Plank
VLM
61
11
0
24 Feb 2025
CHBench: A Chinese Dataset for Evaluating Health in Large Language Models
Chenlu Guo
Nuo Xu
Yi-Ju Chang
Yuan Wu
AI4MH
LM&MA
57
1
0
24 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
V. Cevher
AAML
51
0
0
24 Feb 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
Yiming Li
Tao Xie
Baishakhi Ray
53
3
0
23 Feb 2025
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective
Yuchen Wen
Keping Bi
Wei Chen
J. Guo
Xueqi Cheng
89
1
0
20 Feb 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Vera Neplenbroek
Arianna Bisazza
Raquel Fernández
103
0
0
17 Feb 2025
What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets
Marco Antonio Stranisci
Christian Hardmeier
57
0
0
17 Feb 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil
Vipul Gupta
Sarkar Snigdha Sarathi Das
R. Passonneau
199
0
0
07 Feb 2025
1
2
3
4
...
14
15
16
Next