SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

27 May 2025

Abstract

Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios.To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.

View on arXiv

@article{jiang2025_2505.21605,
  title={ SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge },
  author={ Fengqing Jiang and Fengbo Ma and Zhangchen Xu and Yuetai Li and Bhaskar Ramasubramanian and Luyao Niu and Bo Li and Xianyan Chen and Zhen Xiang and Radha Poovendran },
  journal={arXiv preprint arXiv:2505.21605},
  year={ 2025 }
}

Comments on this paper