ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.18837
  4. Cited By
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

31 January 2025
Mrinank Sharma
Meg Tong
Jesse Mu
Jerry Wei
Jorrit Kruthoff
Scott Goodfriend
Euan Ong
Alwin Peng
Raj Agarwal
Cem Anil
Amanda Askell
Nathan Bailey
Joe Benton
Emma Bluemke
Samuel R. Bowman
Eric Christiansen
Hoagy Cunningham
Andy Dau
Anjali Gopal
Rob Gilson
Logan Graham
Logan Howard
Nimit Kalra
Taesung Lee
Kevin Lin
Peter Lofgren
Francesco Mosconi
Clare O'Hara
Catherine Olsson
Linda Petrini
Samir Rajani
Nikhil Saxena
Alex Silverstein
Tanya Singh
Theodore R. Sumers
Leonard Tang
Kevin K. Troy
Constantin Weisser
Ruiqi Zhong
Giulio Zhou
Jan Leike
Jared Kaplan
Ethan Perez
ArXiv (abs)PDFHTML

Papers citing "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming"

25 / 25 papers shown
Title
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Christina Q. Knight
Kaustubh Deshpande
Ved Sirdeshmukh
Meher Mankikar
Scale Red Team
SEAL Research Team
Julian Michael
AAMLELM
54
0
0
17 Jun 2025
Jailbreak Strength and Model Similarity Predict Transferability
Jailbreak Strength and Model Similarity Predict Transferability
Rico Angell
Jannik Brinkmann
He He
33
0
0
15 Jun 2025
Detecting High-Stakes Interactions with Activation Probes
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
David M. Krueger
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
175
0
0
12 Jun 2025
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Chen Yueh-Han
Nitish Joshi
Yulin Chen
Maksym Andriushchenko
Rico Angell
He He
AAML
121
0
0
12 Jun 2025
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
Yunjia Qi
Xiaozhi Wang
Bin Xu
Lei Hou
Juanzi Li
OffRL
84
0
0
11 Jun 2025
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li
Qiang Sheng
Yehan Yang
Xueyao Zhang
Juan Cao
93
0
0
11 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
106
0
0
11 Jun 2025
Benchmarking Misuse Mitigation Against Covert Adversaries
Benchmarking Misuse Mitigation Against Covert Adversaries
Davis Brown
Mahdi Sabbaghi
Luze Sun
Alexander Robey
George Pappas
Eric Wong
Hamed Hassani
32
0
0
06 Jun 2025
Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models
Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models
Bumjin Park
Jinsil Lee
Jaesik Choi
32
0
0
01 Jun 2025
Learning Safety Constraints for Large Language Models
Learning Safety Constraints for Large Language Models
Xin Chen
Yarden As
Andreas Krause
47
0
0
30 May 2025
A Red Teaming Roadmap Towards System-Level Safety
A Red Teaming Roadmap Towards System-Level Safety
Zifan Wang
Christina Q. Knight
Jeremy Kritz
Willow Primack
Julian Michael
AAML
66
0
0
30 May 2025
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations
Sanjay Kariyappa
G. E. Suh
58
0
0
25 May 2025
An Example Safety Case for Safeguards Against Misuse
An Example Safety Case for Safeguards Against Misuse
Joshua Clymer
Jonah Weinbaum
Robert Kirk
Kimberly Mai
Selena Zhang
Xander Davies
63
0
0
23 May 2025
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Yu Ying Chiu
Zhilin Wang
Sharan Maiya
Yejin Choi
Kyle Fish
Sydney Levine
Evan Hubinger
96
0
0
20 May 2025
Noise Injection Systemically Degrades Large Language Model Safety Guardrails
Noise Injection Systemically Degrades Large Language Model Safety Guardrails
Prithviraj Singh Shahani
Matthias Scheutz
AAML
123
0
0
16 May 2025
Access Controls Will Solve the Dual-Use Dilemma
Access Controls Will Solve the Dual-Use Dilemma
Evžen Wybitul
AAML
93
0
0
14 May 2025
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
Julien Piet
Xiao Huang
Dennis Jacob
Annabella Chow
Maha Alrashed
Geng Zhao
Zhanhao Hu
Chawin Sitawarin
Basel Alomair
David Wagner
AAML
142
1
0
28 Apr 2025
The Structural Safety Generalization Problem
The Structural Safety Generalization Problem
Julius Broomfield
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Tia Nasir
Jason Zhang
Reihaneh Iranmanesh
Sara Pieri
Reihaneh Rabbany
Kellin Pelrine
AAML
104
0
0
13 Apr 2025
Inference-Time Scaling for Generalist Reward Modeling
Inference-Time Scaling for Generalist Reward Modeling
Zijun Liu
P. Wang
Ran Xu
Shirong Ma
Chong Ruan
Ziwei Sun
Yang Liu
Y. Wu
OffRLLRM
217
54
0
03 Apr 2025
Superintelligence Strategy: Expert Version
Superintelligence Strategy: Expert Version
Dan Hendrycks
Eric Schmidt
Alexandr Wang
120
3
0
07 Mar 2025
À la recherche du sens perdu: your favourite LLM might have more to say than you can understand
K. O. T. Erziev
98
0
0
28 Feb 2025
FLAME: Flexible LLM-Assisted Moderation Engine
FLAME: Flexible LLM-Assisted Moderation Engine
Ivan Bakulin
Ilia Kopanichuk
Iaroslav Bespalov
Nikita Radchenko
V. Shaposhnikov
Dmitry V. Dylov
Ivan Oseledets
172
0
0
13 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAMLELM
151
9
0
04 Feb 2025
OverThink: Slowdown Attacks on Reasoning LLMs
OverThink: Slowdown Attacks on Reasoning LLMs
A. Kumar
Jaechul Roh
A. Naseh
Marzena Karpinska
Mohit Iyyer
Amir Houmansadr
Eugene Bagdasarian
LRM
186
25
0
04 Feb 2025
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Gabriel Chua
Shing Yee Chan
Shaun Khoo
204
1
0
20 Nov 2024
1