Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.11462
Cited By
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
Re-assign community
ArXiv
PDF
HTML
Papers citing
"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
50 / 772 papers shown
Title
Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs
Divyanshu Kumar
Umang Jain
Sahil Agarwal
P. Harshangi
37
4
0
13 Oct 2024
DAWN: Designing Distributed Agents in a Worldwide Network
Zahra Aminiranjbar
Jianan Tang
Qiudan Wang
Shubha Pant
Mahesh Viswanathan
LLMAG
AI4CE
31
2
0
11 Oct 2024
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles
Dom Nasrabadi
38
1
0
11 Oct 2024
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act
Philipp Guldimann
Alexander Spiridonov
Robin Staab
Nikola Jovanović
Mark Vero
...
Mislav Balunović
Nikola Konstantinov
Pavol Bielik
Petar Tsankov
Martin Vechev
ELM
58
5
0
10 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
87
1
0
09 Oct 2024
Jet Expansions of Residual Computation
Yihong Chen
Xiangxiang Xu
Yao Lu
Pontus Stenetorp
Luca Franceschi
42
3
0
08 Oct 2024
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification
Tao Meng
Ninareh Mehrabi
Palash Goyal
Anil Ramakrishna
Aram Galstyan
Richard Zemel
Kai-Wei Chang
Rahul Gupta
Charith Peris
27
1
0
07 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
52
1
0
06 Oct 2024
Large Language Models can be Strong Self-Detoxifiers
Ching-Yun Ko
Pin-Yu Chen
Payel Das
Youssef Mroueh
Soham Dan
Georgios Kollias
Subhajit Chaudhury
Tejaswini Pedapati
Luca Daniel
34
2
0
04 Oct 2024
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks
Tianhao Li
Jingyu Lu
Chuangxin Chu
Tianyu Zeng
Yujia Zheng
...
Xuejing Yuan
Xingkai Wang
Keyan Ding
Huajun Chen
Qiang Zhang
ELM
47
3
0
02 Oct 2024
Stars, Stripes, and Silicon: Unravelling the ChatGPT's All-American, Monochrome, Cis-centric Bias
Federico Torrielli
32
0
0
02 Oct 2024
Decoding Hate: Exploring Language Models' Reactions to Hate Speech
Paloma Piot
Javier Parapar
51
1
0
01 Oct 2024
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
Jiaming Li
Lei Zhang
Yunshui Li
Ziqiang Liu
Yuelin Bai
Run Luo
Longze Chen
Min Yang
ALM
35
0
0
27 Sep 2024
BeanCounter: A low-toxicity, large-scale, and open dataset of business-oriented text
Siyan Wang
Bradford Levy
34
2
0
26 Sep 2024
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks
Giandomenico Cornacchia
Giulio Zizzo
Kieran Fraser
Muhammad Zaid Hameed
Ambrish Rawat
Mark Purcell
31
1
0
26 Sep 2024
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
Yifan Jiang
Kriti Aggarwal
Tanmay Laud
Kashif Munir
Jay Pujara
Subhabrata Mukherjee
AAML
59
10
0
26 Sep 2024
Decoding Large-Language Models: A Systematic Overview of Socio-Technical Impacts, Constraints, and Emerging Questions
Zeyneb N. Kaya
Souvick Ghosh
42
0
0
25 Sep 2024
XTRUST: On the Multilingual Trustworthiness of Large Language Models
Yahan Li
Yi Wang
Yi-Ju Chang
Yuan Wu
HILM
LRM
34
0
0
24 Sep 2024
Creative Writers' Attitudes on Writing as Training Data for Large Language Models
Katy Ilonka Gero
Meera Desai
Carly Schnitzler
Nayun Eom
Jack Cushman
Elena L. Glassman
35
1
0
22 Sep 2024
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions
Robert D Morabito
Sangmitra Madhusudan
Tyler McDonald
Ali Emami
31
0
0
20 Sep 2024
Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models
Peiyi Zhang
Yazhou Zhang
Bo Wang
Lu Rong
Jing Qin
Jing Qin
AI4Ed
ELM
49
1
0
19 Sep 2024
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning
Essa Jan
Nouar Aldahoul
Moiz Ali
Faizan Ahmad
Fareed Zaffar
Yasir Zaki
37
3
0
18 Sep 2024
LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents
Amine B. Hassouna
Hana Chaari
Ines Belhaj
LLMAG
37
1
0
17 Sep 2024
Jailbreaking Large Language Models with Symbolic Mathematics
Emet Bethany
Mazal Bethany
Juan Arturo Nolazco Flores
S. Jha
Peyman Najafirad
AAML
26
3
0
17 Sep 2024
Alignment with Preference Optimization Is All You Need for LLM Safety
Réda Alami
Ali Khalifa Almansoori
Ahmed Alzubaidi
M. Seddik
Mugariya Farooq
Hakim Hacid
40
1
0
12 Sep 2024
Retro-li: Small-Scale Retrieval Augmented Generation Supporting Noisy Similarity Searches and Domain Shift Generalization
Gentiana Rashiti
G. Karunaratne
Mrinmaya Sachan
Abu Sebastian
Abbas Rahimi
RALM
44
0
0
12 Sep 2024
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain
Ahmed Imteaj
AAML
VLM
50
3
0
11 Sep 2024
Identity-related Speech Suppression in Generative AI Content Moderation
Oghenefejiro Isaacs Anigboro
Charlie M. Crawford
Danaë Metaxa
Sorelle A. Friedler
Sorelle A. Friedler
26
0
0
09 Sep 2024
The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
Bocheng Chen
Hanqing Guo
Guangjing Wang
Yuanda Wang
Qiben Yan
AAML
44
4
0
01 Sep 2024
MQM-Chat: Multidimensional Quality Metrics for Chat Translation
Yunmeng Li
Jun Suzuki
Makoto Morishita
Kaori Abe
Kentaro Inui
47
2
0
29 Aug 2024
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang
Philip Torr
Mohamed Elhoseiny
Adel Bibi
88
10
0
27 Aug 2024
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Jamba Team
Barak Lenz
Alan Arazi
Amir Bergman
Avshalom Manevich
...
Yehoshua Cohen
Yonatan Belinkov
Y. Globerson
Yuval Peleg Levy
Y. Shoham
44
28
0
22 Aug 2024
Can Artificial Intelligence Embody Moral Values?
T. Swoboda
Lode Lauwaert
27
0
0
22 Aug 2024
Efficient Detection of Toxic Prompts in Large Language Models
Yi Liu
Junzhe Yu
Huijia Sun
Ling Shi
Gelei Deng
Yuqi Chen
Yang Liu
37
4
0
21 Aug 2024
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model
Chenhan Yuan
Fei Huang
Ru Peng
Keming Lu
Bowen Yu
Chang Zhou
Jingren Zhou
KELM
47
0
0
20 Aug 2024
LeCov: Multi-level Testing Criteria for Large Language Models
Xuan Xie
Jiayang Song
Yuheng Huang
Da Song
Fuyuan Zhang
Felix Juefei-Xu
Lei Ma
ELM
36
0
0
20 Aug 2024
Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models
Hila Gonen
Terra Blevins
Alisa Liu
Luke Zettlemoyer
Noah A. Smith
31
5
0
12 Aug 2024
Diffusion Guided Language Modeling
Justin Lovelace
Varsha Kishore
Yiwei Chen
Kilian Q. Weinberger
44
6
0
08 Aug 2024
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models
Shachi H. Kumar
Saurav Sahay
Sahisnu Mazumder
Eda Okur
R. Manuvinakurike
Nicole Beckage
Hsuan Su
Hung-yi Lee
L. Nachman
ELM
42
17
0
07 Aug 2024
Downstream bias mitigation is all you need
Arkadeep Baksi
Rahul Singh
Tarun Joshi
AI4CE
24
0
0
01 Aug 2024
Blockchain for Large Language Model Security and Safety: A Holistic Survey
Caleb Geren
Amanda Board
Gaby G. Dagher
Tim Andersen
Jun Zhuang
51
6
0
26 Jul 2024
Understanding the Interplay of Scale, Data, and Bias in Language Models: A Case Study with BERT
Muhammad Ali
Swetasudha Panda
Qinlan Shen
Michael Wick
Ari Kobren
MILM
42
3
0
25 Jul 2024
GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy
Jan Batzner
Volker Stocker
Stefan Schmid
Gjergji Kasneci
28
1
0
25 Jul 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
61
11
0
25 Jul 2024
Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation
Huimin Lu
Masaru Isonuma
Junichiro Mori
Ichiro Sakata
MU
39
0
0
24 Jul 2024
Towards Aligning Language Models with Textual Feedback
Sauc Abadal Lloret
S. Dhuliawala
K. Murugesan
Mrinmaya Sachan
VLM
50
1
0
24 Jul 2024
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
Guang-Da Liu
Haitao Mao
Jiliang Tang
K. Johnson
LRM
49
8
0
21 Jul 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre
Robert Mahari
Ariel N. Lee
Campbell Lund
Hamidah Oderinwale
...
Hanlin Li
Daphne Ippolito
Sara Hooker
Jad Kabbara
Sandy Pentland
69
36
0
20 Jul 2024
BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization
Ahmed Allam
43
8
0
18 Jul 2024
SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning
Joseph Marvin Imperial
Harish Tayyar Madabushi
49
1
0
18 Jul 2024
Previous
1
2
3
4
5
6
...
14
15
16
Next