Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.11462
Cited By
v1
v2 (latest)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
50 / 814 papers shown
Title
Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse
Anna Kołos
Katarzyna Lorenc
Emilia Wisnios
Agnieszka Karlinska
74
0
0
23 Dec 2024
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum
Christoph Schnabl
Gabor Hollbeck
Silas Alberti
Philip Blinde
Marvin von Hagen
145
2
0
22 Dec 2024
Chained Tuning Leads to Biased Forgetting
Megan Ung
Alicia Sun
Samuel J. Bell
Bhaktipriya Radharapu
Levent Sagun
Adina Williams
CLL
KELM
176
0
0
21 Dec 2024
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Vera Neplenbroek
Arianna Bisazza
Raquel Fernández
215
1
0
18 Dec 2024
Jailbreaking? One Step Is Enough!
Weixiong Zheng
Peijian Zeng
Yuchen Li
Hongyan Wu
Nankai Lin
Jianfei Chen
Aimin Yang
Yimiao Zhou
AAML
113
0
0
17 Dec 2024
ClarityEthic: Explainable Moral Judgment Utilizing Contrastive Ethical Insights from Large Language Models
Yuxi Sun
Wei Gao
Jing Ma
Hongzhan Lin
Ziyang Luo
Wenxuan Zhang
ELM
180
0
0
17 Dec 2024
SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
Zhiwen Chen
Francesco Pinto
Minzhou Pan
Bo Li
120
5
0
09 Dec 2024
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World
Adrian de Wynter
107
1
0
02 Dec 2024
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Jiaxin Wen
Vivek Hebbar
Caleb Larson
Aryan Bhatt
Ansh Radhakrishnan
...
Shi Feng
He He
Ethan Perez
Buck Shlegeris
Akbir Khan
AAML
127
11
0
26 Nov 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
488
2
0
17 Nov 2024
Bias in Large Language Models: Origin, Evaluation, and Mitigation
Yufei Guo
Muzhe Guo
Juntao Su
Zhou Yang
Mengqiu Zhu
Hongfei Li
Mengyang Qiu
Shuo Shuo Liu
AILaw
108
22
0
16 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
175
1
0
12 Nov 2024
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
Ruben Härle
Felix Friedrich
Manuel Brack
Bjorn Deiseroth
P. Schramowski
Kristian Kersting
78
2
0
11 Nov 2024
Minion: A Technology Probe for Resolving Value Conflicts through Expert-Driven and User-Driven Strategies in AI Companion Applications
Xianzhe Fan
Qing Xiao
Xuhui Zhou
Yuran Su
Zhicong Lu
Maarten Sap
Hong Shen
69
0
0
11 Nov 2024
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
Yushi Yang
Filip Sondej
Harry Mayne
Andrew Lee
Adam Mahdi
33
0
0
10 Nov 2024
CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for Improved Prompt Engineering
Ishika Joshi
Simra Shahid
Shreeya Venneti
Manushree Vasu
Yantao Zheng
Yunyao Li
Balaji Krishnamurthy
Gromit Yeuk-Yin Chan
99
4
0
09 Nov 2024
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Yingzi Ma
Jiongxiao Wang
Fei Wang
Siyuan Ma
Jiazhao Li
...
B. Li
Yejin Choi
Mengzhao Chen
Chaowei Xiao
Chaowei Xiao
MU
133
10
0
05 Nov 2024
"It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health
Jiawei Zhou
Amy Z. Chen
Darshi Shah
Laura Schwab Reese
Munmun De Choudhury
86
2
0
04 Nov 2024
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Eric Ma
Gaurav Verma
Srijan Kumar
127
7
0
03 Nov 2024
Identifying Implicit Social Biases in Vision-Language Models
Kimia Hamidieh
Haoran Zhang
Walter Gerych
Thomas Hartvigsen
Marzyeh Ghassemi
VLM
97
16
0
01 Nov 2024
Controlling Language and Diffusion Models by Transporting Activations
P. Rodríguez
Arno Blaas
Michal Klein
Luca Zappella
N. Apostoloff
Marco Cuturi
Xavier Suau
LLMSV
128
6
0
30 Oct 2024
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
Zhihao Liu
Chenhui Hu
ALM
ELM
75
1
0
29 Oct 2024
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs
Zhiting Fan
Ruizhe Chen
Tianxiang Hu
Zuozhu Liu
74
13
0
25 Oct 2024
An Auditing Test To Detect Behavioral Shift in Language Models
Leo Richter
Xuanli He
Pasquale Minervini
Matt J. Kusner
97
0
0
25 Oct 2024
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization
Xiyue Peng
Hengquan Guo
Jiawei Zhang
Dongqing Zou
Ziyu Shao
Honghao Wei
Xin Liu
134
4
0
25 Oct 2024
RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework
Yifan Wang
Vera Demberg
72
1
0
24 Oct 2024
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models
Han Zhang
Hongfu Gao
Qiang Hu
Guanhua Chen
L. Yang
Bingyi Jing
Hongxin Wei
Bing Wang
Haifeng Bai
Lei Yang
AILaw
ELM
138
5
0
24 Oct 2024
Cross-model Control: Improving Multiple Large Language Models in One-time Training
Jiayi Wu
Hao Sun
Hengyi Cai
Lixin Su
Shuaiqiang Wang
D. Yin
Xiang Li
Ming Gao
MU
60
0
0
23 Oct 2024
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
Xintong Wang
Jingheng Pan
Longqin Jiang
Liang Ding
Longqin Jiang
Xingshan Li
Chris Biemann
LLMSV
81
0
0
23 Oct 2024
WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models
Jinghan Jia
Jiancheng Liu
Yihua Zhang
Parikshit Ram
Nathalie Baracaldo
Sijia Liu
MU
160
8
0
23 Oct 2024
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres
Laura Ruis
Ekdeep Singh Lubana
David M. Krueger
LLMSV
75
10
0
22 Oct 2024
NetSafe: Exploring the Topological Safety of Multi-agent Networks
Miao Yu
Shilong Wang
Guibin Zhang
Junyuan Mao
Chenlong Yin
Qijiong Liu
Qingsong Wen
Kun Wang
Yang Wang
87
12
0
21 Oct 2024
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
Aidan Wong
He Cao
Zijing Liu
Yu-Feng Li
84
2
0
21 Oct 2024
A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia
Lance Calvin Lim Gamboa
Mark Lee
59
1
0
20 Oct 2024
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
82
1
0
17 Oct 2024
Controllable Generation via Locally Constrained Resampling
Kareem Ahmed
Kai-Wei Chang
Guy Van den Broeck
77
5
0
17 Oct 2024
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang
Jiahui Zhu
Minjie Ren
Ziqiang Liu
Shiwei Li
...
Yiming Lei
Xiaoyu Wu
Qiqi Zhan
Qingjie Liu
Yunhong Wang
SyDa
180
21
0
16 Oct 2024
Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization
Xingqi Wang
Xiaoyuan Yi
Xing Xie
Jia Jia
62
1
0
16 Oct 2024
With a Grain of SALT: Are LLMs Fair Across Social Dimensions?
Samee Arif
Zohaib Khan
Agha Ali Raza
Awais Athar
62
0
0
16 Oct 2024
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
Anna Sokol
Elizabeth M. Daly
Michael Hind
David Piorkowski
Xiangliang Zhang
Nuno Moniz
Nitesh Chawla
78
0
0
16 Oct 2024
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Weixuan Wang
J. Yang
Wei Peng
LLMSV
112
4
0
16 Oct 2024
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
104
2
0
16 Oct 2024
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Ruimeng Ye
Yang Xiao
Bo Hui
ALM
ELM
OffRL
132
5
0
16 Oct 2024
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Shangbin Feng
Zifeng Wang
Yike Wang
Sayna Ebrahimi
Hamid Palangi
...
Nathalie Rauschmayr
Yejin Choi
Yulia Tsvetkov
Chen-Yu Lee
Tomas Pfister
MoMe
105
9
0
15 Oct 2024
Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs
Divyanshu Kumar
Umang Jain
Sahil Agarwal
P. Harshangi
59
7
0
13 Oct 2024
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles
Dom Nasrabadi
89
1
0
11 Oct 2024
DAWN: Designing Distributed Agents in a Worldwide Network
Zahra Aminiranjbar
Jianan Tang
Qiudan Wang
Shubha Pant
Mahesh Viswanathan
AI4CE
LLMAG
94
2
0
11 Oct 2024
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act
Philipp Guldimann
Alexander Spiridonov
Robin Staab
Nikola Jovanović
Mark Vero
...
Mislav Balunović
Nikola Konstantinov
Pavol Bielik
Petar Tsankov
Martin Vechev
ELM
103
8
0
10 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
175
1
0
09 Oct 2024
Jet Expansions of Residual Computation
Yihong Chen
Xiangxiang Xu
Yao Lu
Pontus Stenetorp
Luca Franceschi
98
3
0
08 Oct 2024
Previous
1
2
3
4
5
6
...
15
16
17
Next