66
10

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

Abstract

Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published athttps:/dxzxy12138.github.io/PhysReason.

View on arXiv
@article{zhang2025_2502.12054,
  title={ PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning },
  author={ Xinyu Zhang and Yuxuan Dong and Yanrui Wu and Jiaxing Huang and Chengyou Jia and Basura Fernando and Mike Zheng Shou and Lingling Zhang and Jun Liu },
  journal={arXiv preprint arXiv:2502.12054},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.