100
2

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Abstract

Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to evaluate LLMs' capabilities in conducting academic surveys\unicodex2013\unicode{x2013}a foundational step in academic research. ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers' relevance and impact; and (3) information organization, structuring knowledge into hierarchical frameworks such as mind-maps. Notably, mind-map construction is treated as a bonus task, reflecting its supplementary role in survey-writing. To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers. To ensure ethical compliance, we do not redistribute copyrighted materials; instead, we provide code to construct the environment from the Semantic Scholar Open Research Corpus (S2ORC). Preliminary evaluations reveal that LLM-based approaches underperform compared to simpler keyword-based retrieval methods, underscoring significant opportunities for advancing LLMs in autonomous research.

View on arXiv
@article{kang2025_2406.10291,
  title={ ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents },
  author={ Hao Kang and Chenyan Xiong },
  journal={arXiv preprint arXiv:2406.10291},
  year={ 2025 }
}
Comments on this paper