Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic environments, formulate action plans, and adapt strategies, has yet to be systematically evaluated or modeled. To address this gap, this paper introduces WGSR-Bench, the first strategy reasoning benchmark for LLMs using wargame as its evaluation environment. Wargame, a quintessential high-complexity strategic scenario, integrates environmental uncertainty, adversarial dynamics, and non-unique strategic choices, making it an effective testbed for assessing LLMs' capabilities in multi-agent decision-making, intent inference, and counterfactual reasoning. WGSR-Bench designs test samples around three core tasks, i.e., Environmental situation awareness, Opponent risk modeling and Policy generation, which serve as the core S-POE architecture, to systematically assess main abilities of strategic reasoning. Finally, an LLM-based wargame agent is designed to integrate these parts for a comprehensive strategy reasoning assessment. With WGSR-Bench, we hope to assess the strengths and limitations of state-of-the-art LLMs in game-theoretic strategic reasoning and to advance research in large model-driven strategic intelligence.

View on arXiv

@article{yin2025_2506.10264,
  title={ WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models },
  author={ Qiyue Yin and Pei Xu and Qiaozhe Li and Shengda Liu and Shengqi Shen and Tong Wang and Yihong Han and Xiaonan Zhao and Likun Yang and Shiyue Cao and Shiyu Qiu and Yuxuan Liu and Shizhao Yu and Lei Cui and Chengxin Yan and Jie Sun and Xiangquan Tang and Kaiqi Huang },
  journal={arXiv preprint arXiv:2506.10264},
  year={ 2025 }
}

Comments on this paper