Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.16665
Cited By
v1
v2
v3 (latest)
SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
22 October 2024
Jing-Jing Li
Valentina Pyatkin
Max Kleiman-Weiner
Liwei Jiang
Nouha Dziri
Anne Collins
Jana Schaich Borg
Maarten Sap
Yejin Choi
Sydney Levine
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior"
1 / 1 papers shown
Title
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALM
ELM
178
79
0
20 Jun 2024
1