19
0

DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures

Main:12 Pages
6 Figures
Bibliography:2 Pages
27 Tables
Appendix:30 Pages
Abstract

Large language models (LLMs) are increasingly deployed for real-world tasks that fundamentally involve data manipulation. A core requirement across these tasks is the ability to perform structural reasoning--that is, to understand and reason about data relationships. For example, customer requests require a temporal ordering, which can be represented by data structures such as queues. However, existing benchmarks primarily focus on high-level, application-driven evaluations without isolating this fundamental capability. To address this gap, we introduce DSR-Bench, a novel benchmark evaluating LLMs' structural reasoning capabilities through data structures, which provide interpretable representations of data relationships. DSR-Bench includes 20 data structures, 35 operations, and 4,140 problem instances, organized hierarchically for fine-grained analysis of reasoning limitations. Our evaluation pipeline is fully automated and deterministic, eliminating subjective human or model-based judgments. Its synthetic nature also ensures scalability and minimizes data contamination risks. We benchmark nine state-of-the-art LLMs. Our analysis shows that instruction-tuned models struggle with basic multi-attribute and multi-hop reasoning. Furthermore, while reasoning-oriented models perform better, they remain fragile on complex and hybrid structures, with the best model achieving an average score of only 47% on the challenge subset. Crucially, models often perform poorly on multi-dimensional data and natural language task descriptions, highlighting a critical gap for real-world deployment.

View on arXiv
@article{he2025_2505.24069,
  title={ DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures },
  author={ Yu He and Yingxi Li and Colin White and Ellen Vitercik },
  journal={arXiv preprint arXiv:2505.24069},
  year={ 2025 }
}
Comments on this paper