ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.16665
  4. Cited By
SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
v1v2v3 (latest)

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

22 October 2024
Jing-Jing Li
Valentina Pyatkin
Max Kleiman-Weiner
Liwei Jiang
Nouha Dziri
Anne Collins
Jana Schaich Borg
Maarten Sap
Yejin Choi
Sydney Levine
ArXiv (abs)PDFHTML

Papers citing "SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior"

1 / 1 papers shown
Title
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALMELM
178
79
0
20 Jun 2024
1