43

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu
Tiancheng He
Huahui Yi
Kun Wang
Xi Chen
Gongli Xi
Qiankun Li
Kang Li
Yang Liu
Zhigang Zeng
Main:8 Pages
6 Figures
Bibliography:4 Pages
8 Tables
Appendix:10 Pages
Abstract

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception \to Reasoning \to Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT \to DPO \to GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 \to 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 \to 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 \to 59.21; 7B: 66.39 \to 66.81). Codes are available atthis https URL.

View on arXiv
Comments on this paper