ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.07215
48
0

Measuring General Intelligence with Generated Games

12 May 2025
Vivek Verma
David Huang
William Chen
Dan Klein
Nicholas Tomlin
    ReLM
    ELM
    LM&MA
    LRM
ArXivPDFHTML
Abstract

We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

View on arXiv
@article{verma2025_2505.07215,
  title={ Measuring General Intelligence with Generated Games },
  author={ Vivek Verma and David Huang and William Chen and Dan Klein and Nicholas Tomlin },
  journal={arXiv preprint arXiv:2505.07215},
  year={ 2025 }
}
Comments on this paper