SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

13 February 2026

Yujiong Shen

Yajie Yang

Zhiheng Xi

Binze Hu

Huayu Sha

Jiazheng Zhang

Qiyuan Peng

Junlin Shang

Jixuan Huang

Yutao Fan

Jingqi Tong

Shihan Dou

Ming Zhang

Lei Bai

Zhenfei Yin

Tao Gui

Xingjun Ma

Qi Zhang

Xuanjing Huang

Yu-Gang Jiang

LLMAG

ArXiv (abs)PDF HTML HuggingFace (5 upvotes)Github (1★)

Main:11 Pages

11 Figures

Bibliography:4 Pages

10 Tables

Appendix:18 Pages

Abstract

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

View on arXiv

Comments on this paper