ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.16906
51
1

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

24 February 2025
Qin Zhu
Fei Huang
Runyu Peng
K. Lu
Bowen Yu
Qinyuan Cheng
Xipeng Qiu
Xuanjing Huang
Junyang Lin
    ReLM
    ELM
    LRM
ArXivPDFHTML
Abstract

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.

View on arXiv
@article{zhu2025_2502.16906,
  title={ AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models },
  author={ Qin Zhu and Fei Huang and Runyu Peng and Keming Lu and Bowen Yu and Qinyuan Cheng and Xipeng Qiu and Xuanjing Huang and Junyang Lin },
  journal={arXiv preprint arXiv:2502.16906},
  year={ 2025 }
}
Comments on this paper