ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.21556
27
0

Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

26 May 2025
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
ArXiv (abs)PDFHTML
Main:9 Pages
14 Figures
Bibliography:4 Pages
12 Tables
Appendix:18 Pages
Abstract

Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.

View on arXiv
@article{kim2025_2505.21556,
  title={ Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts },
  author={ Hee-Seon Kim and Minbeom Kim and Wonjun Lee and Kihyun Kim and Changick Kim },
  journal={arXiv preprint arXiv:2505.21556},
  year={ 2025 }
}
Comments on this paper