ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.18837
184
28

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

31 January 2025
Mrinank Sharma
Meg Tong
Jesse Mu
Jerry Wei
Jorrit Kruthoff
Scott Goodfriend
Euan Ong
Alwin Peng
Raj Agarwal
Cem Anil
Amanda Askell
Nathan Bailey
Joe Benton
Emma Bluemke
Samuel R. Bowman
Eric Christiansen
Hoagy Cunningham
Andy Dau
Anjali Gopal
Rob Gilson
Logan Graham
Logan Howard
Nimit Kalra
Taesung Lee
Kevin Lin
Peter Lofgren
Francesco Mosconi
Clare O'Hara
Catherine Olsson
Linda Petrini
Samir Rajani
Nikhil Saxena
Alex Silverstein
Tanya Singh
Theodore R. Sumers
Leonard Tang
Kevin K. Troy
Constantin Weisser
Ruiqi Zhong
Giulio Zhou
Jan Leike
Jared Kaplan
Ethan Perez
ArXivPDFHTML
Abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

View on arXiv
Comments on this paper