Malicious or manipulated prompts are known to exploit text-to-image models to generate unsafe images. Existing studies, however, focus on the passive exploitation of such harmful capabilities. In this paper, we investigate the proactive generation of unsafe images from benign prompts (e.g., a photo of a cat) through maliciously modified text-to-image models. Our preliminary investigation demonstrates that poisoning attacks are a viable method to achieve this goal but uncovers significant side effects, where unintended spread to non-targeted prompts compromises attack stealthiness. Root cause analysis identifies conceptual similarity as an important contributing factor to these side effects. To address this, we propose a stealthy poisoning attack method that balances covertness and performance. Our findings highlight the potential risks of adopting text-to-image models in real-world scenarios, thereby calling for future research and safety measures in this space.
View on arXiv@article{wu2025_2310.16613, title={ On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts }, author={ Yixin Wu and Ning Yu and Michael Backes and Yun Shen and Yang Zhang }, journal={arXiv preprint arXiv:2310.16613}, year={ 2025 } }