ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2009.11462
  4. Cited By
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
  Models

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

24 September 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
ArXivPDFHTML

Papers citing "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"

50 / 772 papers shown
Title
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
H. Malik
Fahad Shamshad
Muzammal Naseer
Karthik Nandakumar
Fahad Shahbaz Khan
Salman Khan
AAML
MLLM
VLM
70
0
0
03 Feb 2025
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
Isha Gupta
David Khachaturov
Robert D. Mullins
AAML
AuLLM
67
2
0
02 Feb 2025
Risk-Aware Distributional Intervention Policies for Language Models
Bao Nguyen
Binh Nguyen
Duy Nguyen
V. Nguyen
34
1
0
28 Jan 2025
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models
Abdulkadir Erol
Trilok Padhi
Agnik Saha
Ugur Kursuncu
Mehmet Emin Aktas
51
1
0
17 Jan 2025
Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse
Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse
Anna Kołos
Katarzyna Lorenc
Emilia Wisnios
Agnieszka Karlinska
36
0
0
08 Jan 2025
Clinical Insights: A Comprehensive Review of Language Models in Medicine
Clinical Insights: A Comprehensive Review of Language Models in Medicine
Nikita Neveditsin
Pawan Lingras
V. Mago
LM&MA
58
4
0
08 Jan 2025
LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases
LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases
Dylan Bouchard
Mohit Singh Chauhan
David Skarbrevik
Viren Bajaj
Zeya Ahmad
43
0
0
06 Jan 2025
CALM: Curiosity-Driven Auditing for Large Language Models
Xiang Zheng
Longxiang Wang
Yi Liu
Jie Zhang
Chao Shen
Cong Wang
MLAU
55
1
0
06 Jan 2025
LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena
LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena
Stefan Pasch
40
0
0
04 Jan 2025
Diverse and Effective Red Teaming with Auto-generated Rewards and
  Multi-step Reinforcement Learning
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
4
0
24 Dec 2024
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Zaitang Li
Pin-Yu Chen
Tsung-Yi Ho
AAML
41
0
0
23 Dec 2024
Cannot or Should Not? Automatic Analysis of Refusal Composition in
  IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum
Christoph Schnabl
Gabor Hollbeck
Silas Alberti
Philip Blinde
Marvin von Hagen
92
2
0
22 Dec 2024
Chained Tuning Leads to Biased Forgetting
Chained Tuning Leads to Biased Forgetting
Megan Ung
Alicia Sun
Samuel J. Bell
Bhaktipriya Radharapu
Levent Sagun
Adina Williams
CLL
KELM
89
0
0
21 Dec 2024
Jailbreaking? One Step Is Enough!
Jailbreaking? One Step Is Enough!
Weixiong Zheng
Peijian Zeng
Y. Li
Hongyan Wu
Nankai Lin
Jianfei Chen
Aimin Yang
Yue Zhou
AAML
86
0
0
17 Dec 2024
ClarityEthic: Explainable Moral Judgment Utilizing Contrastive Ethical Insights from Large Language Models
ClarityEthic: Explainable Moral Judgment Utilizing Contrastive Ethical Insights from Large Language Models
Yuxi Sun
Wei Gao
Jing Ma
Hongzhan Lin
Ziyang Luo
Wenxuan Zhang
ELM
82
0
0
17 Dec 2024
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM
  World
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World
Adrian de Wynter
73
1
0
02 Dec 2024
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Jiaxin Wen
Vivek Hebbar
Caleb Larson
Aryan Bhatt
Ansh Radhakrishnan
...
Shi Feng
He He
Ethan Perez
Buck Shlegeris
Akbir Khan
AAML
83
8
0
26 Nov 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
202
0
0
17 Nov 2024
Bias in Large Language Models: Origin, Evaluation, and Mitigation
Yufei Guo
Muzhe Guo
Juntao Su
Zhou Yang
Mengqiu Zhu
Hongfei Li
Mengyang Qiu
Shuo Shuo Liu
AILaw
33
10
0
16 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
78
0
0
12 Nov 2024
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering
  in LLMs
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
Ruben Härle
Felix Friedrich
Manuel Brack
Bjorn Deiseroth
P. Schramowski
Kristian Kersting
48
0
0
11 Nov 2024
Minion: A Technology Probe for Resolving Value Conflicts through
  Expert-Driven and User-Driven Strategies in AI Companion Applications
Minion: A Technology Probe for Resolving Value Conflicts through Expert-Driven and User-Driven Strategies in AI Companion Applications
Xianzhe Fan
Qing Xiao
Xuhui Zhou
Yuran Su
Zhicong Lu
Maarten Sap
Hong Shen
43
0
0
11 Nov 2024
Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity
  Reduction
Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity Reduction
Yushi Yang
Filip Sondej
Harry Mayne
Adam Mahdi
31
1
0
10 Nov 2024
CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for
  Improved Prompt Engineering
CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for Improved Prompt Engineering
Ishika Joshi
Simra Shahid
Shreeya Venneti
Manushree Vasu
Yantao Zheng
Yunyao Li
Balaji Krishnamurthy
Gromit Yeuk-Yin Chan
34
3
0
09 Nov 2024
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Yingzi Ma
Jiongxiao Wang
Fei Wang
Siyuan Ma
Jiazhao Li
...
B. Li
Yejin Choi
Mengzhao Chen
Chaowei Xiao
Chaowei Xiao
MU
58
6
0
05 Nov 2024
"It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool
  for LLM Adoption in Public Health
"It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health
Jiawei Zhou
Amy Z. Chen
Darshi Shah
Laura Schwab Reese
Munmun De Choudhury
41
2
0
04 Nov 2024
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Eric Ma
Gaurav Verma
Srijan Kumar
68
6
0
03 Nov 2024
Identifying Implicit Social Biases in Vision-Language Models
Identifying Implicit Social Biases in Vision-Language Models
Kimia Hamidieh
Haoran Zhang
Walter Gerych
Thomas Hartvigsen
Marzyeh Ghassemi
VLM
36
11
0
01 Nov 2024
Controlling Language and Diffusion Models by Transporting Activations
Controlling Language and Diffusion Models by Transporting Activations
P. Rodríguez
Arno Blaas
Michal Klein
Luca Zappella
N. Apostoloff
Marco Cuturi
Xavier Suau
LLMSV
42
5
0
30 Oct 2024
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
Zhihao Liu
Chenhui Hu
ALM
ELM
56
1
0
29 Oct 2024
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization
Xiyue Peng
Hengquan Guo
Jiawei Zhang
Dongqing Zou
Ziyu Shao
Honghao Wei
Xin Liu
44
0
0
25 Oct 2024
RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text
  Generation Framework
RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework
Yifan Wang
Vera Demberg
29
0
0
24 Oct 2024
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models
Han Zhang
Hongfu Gao
Qiang Hu
Guanhua Chen
L. Yang
Bingyi Jing
Hongxin Wei
Bing Wang
Haifeng Bai
Lei Yang
AILaw
ELM
49
2
0
24 Oct 2024
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient
  Semantic Steering in Large Language Models
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models
Xintong Wang
Jingheng Pan
Longqin Jiang
Liang Ding
Xingshan Li
Chris Biemann
LLMSV
34
0
0
23 Oct 2024
Cross-model Control: Improving Multiple Large Language Models in
  One-time Training
Cross-model Control: Improving Multiple Large Language Models in One-time Training
Jiayi Wu
Hao Sun
Hengyi Cai
Lixin Su
S. Wang
Dawei Yin
Xiang Li
Ming Gao
MU
39
0
0
23 Oct 2024
WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models
WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models
Jinghan Jia
Jiancheng Liu
Yihua Zhang
Parikshit Ram
Nathalie Baracaldo
Sijia Liu
MU
40
2
0
23 Oct 2024
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres
Laura Ruis
Ekdeep Singh Lubana
David M. Krueger
LLMSV
30
5
0
22 Oct 2024
NetSafe: Exploring the Topological Safety of Multi-agent Networks
NetSafe: Exploring the Topological Safety of Multi-agent Networks
Miao Yu
Shilong Wang
Guibin Zhang
Junyuan Mao
Chenlong Yin
Qijiong Liu
Qingsong Wen
Kun Wang
Yang Wang
41
6
0
21 Oct 2024
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical
  Synthesis
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
Aidan Wong
He Cao
Zijing Liu
Yu Li
52
2
0
21 Oct 2024
A Novel Interpretability Metric for Explaining Bias in Language Models:
  Applications on Multilingual Models from Southeast Asia
A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia
Lance Calvin Lim Gamboa
Mark Lee
30
0
0
20 Oct 2024
SPIN: Self-Supervised Prompt INjection
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
33
0
0
17 Oct 2024
Controllable Generation via Locally Constrained Resampling
Controllable Generation via Locally Constrained Resampling
Kareem Ahmed
Kai-Wei Chang
Mathias Niepert
18
2
0
17 Oct 2024
POROver: Improving Safety and Reducing Overrefusal in Large Language
  Models with Overgeneration and Preference Optimization
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
43
0
0
16 Oct 2024
BenchmarkCards: Large Language Model and Risk Reporting
BenchmarkCards: Large Language Model and Risk Reporting
Anna Sokol
Nuno Moniz
Elizabeth M. Daly
Michael Hind
Nitesh V. Chawla
36
0
0
16 Oct 2024
A Survey on Data Synthesis and Augmentation for Large Language Models
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang
Jiahui Zhu
Minjie Ren
Ziqiang Liu
Shiwei Li
...
Yiming Lei
Xiaoyu Wu
Qiqi Zhan
Qingjie Liu
Yunhong Wang
SyDa
45
18
0
16 Oct 2024
Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via
  Lightweight Value Optimization
Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization
Xingqi Wang
Xiaoyuan Yi
Xing Xie
Jia Jia
24
1
0
16 Oct 2024
With a Grain of SALT: Are LLMs Fair Across Social Dimensions?
With a Grain of SALT: Are LLMs Fair Across Social Dimensions?
Samee Arif
Zohaib Khan
Agha Ali Raza
Awais Athar
34
0
0
16 Oct 2024
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Weixuan Wang
J. Yang
Wei Peng
LLMSV
37
3
0
16 Oct 2024
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Ruimeng Ye
Yang Xiao
Bo Hui
ALM
ELM
OffRL
31
2
0
16 Oct 2024
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm
  Intelligence
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Shangbin Feng
Zifeng Wang
Yike Wang
Sayna Ebrahimi
Hamid Palangi
...
Nathalie Rauschmayr
Yejin Choi
Yulia Tsvetkov
Chen-Yu Lee
Tomas Pfister
MoMe
39
3
0
15 Oct 2024
Previous
12345...141516
Next