VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search

10 October 2025

MingSheng Li

Guangze Zhao

Sichen Liu

ArXiv (abs)PDF HTML Github

Main:4 Pages

1 Figures

Bibliography:2 Pages

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal perception and generation, yet their safety alignment remains a criticalthis http URLdefenses and vulnerable to multimodal jailbreaks, as visual inputs introduce new attack surfaces, reasoning chains lack safety supervision, and alignment often degrades under modalitythis http URLovercome these limitation, we propose VisuoAlign, a framework for multi-modal safety alignment via prompt-guided treethis http URLembeds safety constrains into the reasoning process through visual-textual interactive prompts, employs Monte Carlo Tree Search(MCTS) to systematically construct diverse safety-critical prompt trajectories, and introduces prompt-based scaling to ensure real-time risk detection and compliantthis http URLexperiments demonstrate that VisuoAlign proactively exposes risks, enables comprehensive dataset generation, and significantly improves the robustness of LVLMs against complex cross-modal threats.

View on arXiv

Comments on this paper