SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

31 May 2025

Main:6 Pages

4 Figures

Bibliography:1 Pages

3 Tables

Appendix:5 Pages

Abstract

Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.

View on arXiv

@article{kuo2025_2506.00668,
  title={ SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues },
  author={ Martin Kuo and Jianyi Zhang and Aolin Ding and Louis DiValentin and Amin Hass and Benjamin F Morris and Isaac Jacobson and Randolph Linderman and James Kiessling and Nicolas Ramos and Bhavna Gopal and Maziyar Baran Pouyan and Changwei Liu and Hai Li and Yiran Chen },
  journal={arXiv preprint arXiv:2506.00668},
  year={ 2025 }
}

Comments on this paper