Evaluating Gemini in an arena for learning

Main:10 Pages

1 Figures

Bibliography:4 Pages

4 Tables

Appendix:12 Pages

Abstract

Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an "arena for learning" where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, $N = 189$ educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which $N = 206$ experts judged which model better supported the user's learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance across key principles of good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading model for learning.

View on arXiv

@article{google2025_2505.24477,
  title={ Evaluating Gemini in an arena for learning },
  author={ LearnLM Team Google and Abhinit Modi and Aditya Srikanth Veerubhotla and Aliya Rysbek and Andrea Huber and Ankit Anand and Avishkar Bhoopchand and Brett Wiltshire and Daniel Gillick and Daniel Kasenberg and Eleni Sgouritsa and Gal Elidan and Hengrui Liu and Holger Winnemoeller and Irina Jurenka and James Cohan and Jennifer She and Julia Wilkowski and Kaiz Alarakyia and Kevin R. McKee and Komal Singh and Lisa Wang and Markus Kunesch and Miruna Pîslar and Niv Efron and Parsa Mahmoudieh and Pierre-Alexandre Kamienny and Sara Wiltberger and Shakir Mohamed and Shashank Agarwal and Shubham Milind Phal and Sun Jae Lee and Theofilos Strinopoulos and Wei-Jen Ko and Yael Gold-Zamir and Yael Haramaty and Yannis Assael },
  journal={arXiv preprint arXiv:2505.24477},
  year={ 2025 }
}

Comments on this paper