Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

25 May 2025

Papers citing "Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations"

2 / 2 papers shown

Title
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming Mrinank Sharma Meg Tong Jesse Mu Jerry Wei Jorrit Kruthoff ... Ruiqi Zhong Giulio Zhou Jan Leike Jared Kaplan Ethan Perez 209 34 0 31 Jan 2025
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? Egor Zverev Sahar Abdelnabi Soroush Tabesh Mario Fritz Christoph H. Lampert 117 27 0 11 Mar 2024