v1v2 (latest)

Should LLM Safety Be More Than Refusing Harmful Instructions?

3 June 2025

Main:7 Pages

2 Figures

Bibliography:3 Pages

9 Tables

Appendix:4 Pages

Abstract

This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.

View on arXiv

@article{maskey2025_2506.02442,
  title={ Should LLM Safety Be More Than Refusing Harmful Instructions? },
  author={ Utsav Maskey and Mark Dras and Usman Naseem },
  journal={arXiv preprint arXiv:2506.02442},
  year={ 2025 }
}

Comments on this paper