ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.17932
48
0

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

23 March 2025
Xunguang Wang
Wenxuan Wang
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Daoyuan Wu
Shuai Wang
ArXivPDFHTML
Abstract

Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

View on arXiv
@article{wang2025_2503.17932,
  title={ STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models },
  author={ Xunguang Wang and Wenxuan Wang and Zhenlan Ji and Zongjie Li and Pingchuan Ma and Daoyuan Wu and Shuai Wang },
  journal={arXiv preprint arXiv:2503.17932},
  year={ 2025 }
}
Comments on this paper