35
0

Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization

Abstract

Ensuring the safety of generated content remains a fundamental challenge for Text-to-Image (T2I) generation. Existing studies either fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality. To address these issues, we propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I models. SC-DPO integrates safety constraints into the general human preference calibration, aiming to maximize the likelihood of generating human-preferred samples while minimizing the safety cost of the generated outputs. In SC-DPO, we introduce a safety cost model to accurately quantify harmful levels for images, and train it effectively using the proposed contrastive learning and cost anchoring objectives. To apply SC-DPO for effective T2I safety alignment, we constructed SCP-10K, a safety-constrained preference dataset containing rich harmful concepts, which blends safety-constrained preference pairs under both harmful and clean instructions, further mitigating the trade-off between safety and sample quality. Additionally, we propose a Dynamic Focusing Mechanism (DFM) for SC-DPO, promoting the model's learning of difficult preference pair samples. Extensive experiments demonstrate that SC-DPO outperforms existing methods, effectively defending against various NSFW content while maintaining optimal sample quality and human preference alignment. Additionally, SC-DPO exhibits resilience against adversarial prompts designed to generate harmful content.

View on arXiv
@article{ruan2025_2504.14290,
  title={ Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization },
  author={ Shouwei Ruan and Zhenyu Wu and Yao Huang and Ruochen Zhang and Yitong Sun and Caixin Kang and Xingxing Wei },
  journal={arXiv preprint arXiv:2504.14290},
  year={ 2025 }
}
Comments on this paper