Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.11462
Cited By
v1
v2 (latest)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
50 / 814 papers shown
Title
PL-Guard: Benchmarking Language Model Safety for Polish
Aleksandra Krasnodębska
Karolina Seweryn
Szymon Łukasik
Wojciech Kusa
10
0
0
19 Jun 2025
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
Zhengyang Shan
Emily Ruth Diana
Jiawei Zhou
43
0
0
18 Jun 2025
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs
Sayed Mohammad Vakilzadeh Hatefi
Maximilian Dreyer
Reduan Achtibat
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
34
0
0
16 Jun 2025
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
Hen Davidov
Gilad Freidkin
Shai Feldman
Yaniv Romano
30
0
0
16 Jun 2025
Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
Youze Wang
Zijun Chen
Ruoyu Chen
Shishen Gu
Yinpeng Dong
Hang Su
Jun Zhu
Meng Wang
Richang Hong
Wenbo Hu
35
0
0
14 Jun 2025
Hatevolution: What Static Benchmarks Don't Tell Us
Chiara Di Bonaventura
Barbara McGillivray
Yulan He
Albert Meroño-Peñuela
26
0
0
13 Jun 2025
WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models
Qiyue Yin
Pei Xu
Qiaozhe Li
Shengda Liu
S. Shen
...
Lei Cui
Chengxin Yan
Jie Sun
Xiangquan Tang
K. Huang
LLMAG
ELM
LRM
119
0
0
12 Jun 2025
Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints
Yaswanth Chittepu
Blossom Metevier
Will Schwarzer
Austin Hoag
S. Niekum
Philip S Thomas
25
0
0
09 Jun 2025
A Systematic Review of Poisoning Attacks Against Large Language Models
Neil Fendley
Edward W. Staley
Joshua Carney
William Redman
Marie Chau
Nathan G. Drenkow
AAML
PILM
23
0
0
06 Jun 2025
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
Sooyung Choi
Jaehyeok Lee
Xiaoyuan Yi
Jing Yao
Xing Xie
JinYeong Bak
25
0
0
06 Jun 2025
IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages
Muhammad Falensi Azmi
Muhammad Dehan Al Kautsar
Alfan Farizki Wicaksono
Fajri Koto
54
0
0
03 Jun 2025
Something Just Like TRuST : Toxicity Recognition of Span and Target
Berk Atil
Namrata Sureddy
R. Passonneau
29
0
0
02 Jun 2025
Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model
Yuanhe Tian
Mingjie Deng
Guoqing Jin
Yan Song
MU
KELM
59
0
0
02 Jun 2025
IF-GUIDE: Influence Function-Guided Detoxification of LLMs
Zachary Coalson
Juhan Bae
Nicholas Carlini
Sanghyun Hong
TDI
86
0
0
02 Jun 2025
Neuro-Symbolic Generative Diffusion Models for Physically Grounded, Robust, and Safe Generation
Jacob K Christopher
Michael Cardei
Jinhao Liang
Ferdinando Fioretto
DiffM
MedIm
46
0
0
01 Jun 2025
SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions
Weijie Xu
Shixian Cui
Xi Fang
Chi Xue
Stephanie Eckman
Chandan K. Reddy
ELM
43
0
0
31 May 2025
Learning Safety Constraints for Large Language Models
Xin Chen
Yarden As
Andreas Krause
47
0
0
30 May 2025
Cascading Adversarial Bias from Injection to Distillation in Language Models
Harsh Chaudhari
Jamie Hayes
Matthew Jagielski
Ilia Shumailov
Milad Nasr
Alina Oprea
AAML
24
0
0
30 May 2025
Large Language Models Often Know When They Are Being Evaluated
Joe Needham
Giles Edkins
Govind Pimpale
Henning Bartsch
Marius Hobbhahn
LLMAG
ELM
ALM
35
0
0
28 May 2025
The Multilingual Divide and Its Impact on Global AI Safety
Aidan Peppin
Julia Kreutzer
Alice Schoenauer Sebag
Kelly Marchisio
Beyza Ermis
...
Wei-Yin Ko
Ahmet Üstün
Matthias Gallé
Marzieh Fadaee
Sara Hooker
ELM
77
1
0
27 May 2025
Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation
Zhibo Wang
Xiaoze Jiang
Zhiheng Qin
Enyun Yu
Han Li
51
1
0
27 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
38
0
0
26 May 2025
Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation
Tanjil Hasan Sakib
Md. Tanzib Hosain
Md. Kishor Morol
ALM
47
0
0
26 May 2025
Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
Peijie Dong
Zhenheng Tang
Xiang Liu
Lujun Li
Xiaowen Chu
Bo Li
106
0
0
26 May 2025
SGM: A Framework for Building Specification-Guided Moderation Filters
M. Fatehkia
Enes Altinisik
Husrev Taha Sencar
51
1
0
26 May 2025
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning
Yuzheng Hu
Fan Wu
Haotian Ye
David A. Forsyth
James Y. Zou
Nan Jiang
Jiaqi W. Ma
Han Zhao
OffRL
79
0
0
25 May 2025
Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments
Amel Muminovic
ELM
AI4MH
22
0
0
25 May 2025
Paying Alignment Tax with Contrastive Learning
Buse Sibel Korkmaz
Rahul Nair
Elizabeth M. Daly
Antonio del Rio Chanona
74
0
0
25 May 2025
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
Reva Schwartz
Rumman Chowdhury
Akash Kundu
Heather Frase
Marzieh Fadaee
...
Andrew Thompson
Maya Carlyle
Qinghua Lu
Matthew Holmes
Theodora Skeadas
71
0
0
24 May 2025
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
Mengru Wang
Ziwen Xu
Shengyu Mao
Shumin Deng
Zhaopeng Tu
Ningyu Zhang
N. Zhang
LLMSV
135
0
0
23 May 2025
Relative Bias: A Comparative Framework for Quantifying Bias in LLMs
Alireza Arbabi
Florian Kerschbaum
209
0
0
22 May 2025
MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models
Bohan Jin
Shuhan Qi
Kehai Chen
Xinyi Guo
Xuan Wang
59
0
0
22 May 2025
Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
Himanshu Beniwal
Y. Kim
Maarten Sap
Soham Dan
Thomas Hartvigsen
CLL
89
0
0
22 May 2025
Revealing Language Model Trajectories via Kullback-Leibler Divergence
Ryo Kishino
Yusuke Takase
Momose Oyama
Hiroaki Yamagiwa
Hidetoshi Shimodaira
96
0
0
21 May 2025
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models
Burak Erinç Çetin
Yıldırım Özen
Elif Naz Demiryılmaz
Kaan Engür
Cagri Toraman
ELM
92
0
0
21 May 2025
Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models
Md Rafi Ur Rashid
Vishnu Asutosh Dasu
Ye Wang
Gang Tan
Shagufta Mehnaz
AAML
ELM
109
0
0
20 May 2025
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
Agam Goyal
Vedant Rathi
William Yeh
Yian Wang
Yuen Chen
Hari Sundaram
102
0
0
20 May 2025
Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations
Somnath Banerjee
Pratyush Chatterjee
Shanu Kumar
Sayan Layek
Parag Agrawal
Rima Hazra
Animesh Mukherjee
AAML
203
0
0
20 May 2025
Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration
Tatia Tsmindashvili
Ana Kolkhidashvili
Dachi Kurtskhalia
Nino Maghlakelidze
Elene Mekvabishvili
Guram Dentoshvili
Orkhan Shamilov
Zaal Gachechiladze
Steven Saporta
David Dachi Choladze
185
0
0
18 May 2025
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Yew-Soon Ong
Qi Wang
Ke Tang
108
3
0
17 May 2025
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs
Sijia Chen
Xiaomin Li
Mengxue Zhang
Eric Hanchen Jiang
Qingcheng Zeng
Chen-Hsiang Yu
AAML
MU
ELM
136
0
0
16 May 2025
Retrieval Augmented Generation Evaluation for Health Documents
Mario Ceresa
Lorenzo Bertolini
Valentin Comte
Nicholas Spadaro
Barbara Raffael
...
Sergio Consoli
Amalia Muñoz Piñeiro
Alex Patak
Maddalena Querci
Tobias Wiesenthal
RALM
3DV
108
0
1
07 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
78
2
0
05 May 2025
Teaching Models to Understand (but not Generate) High-risk Data
Ryan Yixiang Wang
Matthew Finlayson
Luca Soldaini
Swabha Swayamdipta
Robin Jia
394
0
0
05 May 2025
Semantic Probabilistic Control of Language Models
Kareem Ahmed
Catarina G Belém
Padhraic Smyth
Sameer Singh
119
1
0
04 May 2025
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Sai Krishna Mendu
Harish Yenala
Aditi Gulati
Shanu Kumar
Parag Agrawal
125
1
0
04 May 2025
SAGE
\texttt{SAGE}
SAGE
: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
93
0
0
28 Apr 2025
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
170
0
0
27 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
253
7
0
26 Apr 2025
MODP: Multi Objective Directional Prompting
Aashutosh Nema
Samaksh Gulati
Evangelos Giakoumakis
Bipana Thapaliya
LLMAG
84
0
0
25 Apr 2025
1
2
3
4
...
15
16
17
Next