Towards Inference-time Category-wise Safety Steering for Large Language Models

2 October 2024

Papers citing "Towards Inference-time Category-wise Safety Steering for Large Language Models"

2 / 2 papers shown

Title
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Maria Kirch Constantin Weisser Severin Field Helen Yannakoudakis Stephen Casper 39 2 0 02 Nov 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip H. S. Torr Francesco Pinto 47 0 0 30 Oct 2024