ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.19253
  4. Cited By
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
v1v2 (latest)

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

25 May 2025
Joao Coelho
Jingjie Ning
Jingyuan He
Kangrui Mao
Abhijay Paladugu
Pranav Setlur
Jiahe Jin
Jamie Callan
João Magalhães
Bruno Martins
Chenyan Xiong
ArXiv (abs)PDFHTML

Papers citing "DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research"

21 / 21 papers shown
Title
Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
Nandan Thakur
Ronak Pradeep
Shivani Upadhyay
Daniel Fernando Campos
Nick Craswell
Jimmy Lin
ELM
97
3
0
21 Apr 2025
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Salaheddin Alzubi
Creston Brooks
Purva Chiniya
Edoardo Contente
Chiara von Gerlach
...
Arda Kaz
Windsor Nguyen
Sewoong Oh
Himanshu Tyagi
Pramod Viswanath
VLMELMLRM
152
12
0
26 Mar 2025
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin
Hansi Zeng
Zhenrui Yue
Dong Wang
Sercan O. Arik
Dong Wang
Hamed Zamani
Jiawei Han
RALMReLMKELMOffRLAI4TSLRM
200
120
0
12 Mar 2025
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song
Jinhao Jiang
Yingqian Min
Jie Chen
Zhongfu Chen
Wayne Xin Zhao
Lei Fang
Ji-Rong Wen
AI4TSLRMKELM
168
43
0
07 Mar 2025
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Junde Wu
Jiayuan Zhu
Yuyuan Liu
LRM
99
25
0
07 Feb 2025
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Satyapriya Krishna
Kalpesh Krishna
Anhad Mohananey
Steven Schwarcz
Adam Stambler
Shyam Upadhyay
Manaal Faruqui
ReLM3DVLRMRALM
81
30
0
28 Jan 2025
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
Yujia Zhou
Yan Liu
Xiaoxi Li
Jiajie Jin
Hongjin Qian
Zheng Liu
Chaozhuo Li
Zhicheng Dou
Tsung-Yi Ho
Philip S. Yu
3DVRALM
83
39
0
16 Sep 2024
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented
  Generation
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
Dongyu Ru
Lin Qiu
Xiangkun Hu
Tianhang Zhang
Peng Shi
...
Tong He
Zhiguo Wang
Pengfei Liu
Yue Zhang
Zheng Zhang
95
20
0
15 Aug 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at
  Scale
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
117
259
0
25 Jun 2024
MiniCPM: Unveiling the Potential of Small Language Models with Scalable
  Training Strategies
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu
Yuge Tu
Xu Han
Chaoqun He
Ganqu Cui
...
Chaochao Jia
Guoyang Zeng
Dahai Li
Zhiyuan Liu
Maosong Sun
MoE
101
342
0
09 Apr 2024
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Parishad BehnamGhader
Vaibhav Adlakha
Marius Mosbach
Dzmitry Bahdanau
Nicolas Chapados
Siva Reddy
94
238
0
09 Apr 2024
Researchy Questions: A Dataset of Multi-Perspective, Decompositional
  Questions for LLM Web Agents
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents
Corby Rosset
Ho-Lam Chung
Guanghui Qin
Ethan C. Chau
Zhuo Feng
Ahmed Hassan Awadallah
Jennifer Neville
Nikhil Rao
57
14
0
27 Feb 2024
GAIA: a benchmark for General AI Assistants
GAIA: a benchmark for General AI Assistants
Grégoire Mialon
Clémentine Fourrier
Craig Swift
Thomas Wolf
Yann LeCun
Thomas Scialom
AI4MHALMELMRALM
81
182
0
21 Nov 2023
ARES: An Automated Evaluation Framework for Retrieval-Augmented
  Generation Systems
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
Jon Saad-Falcon
Omar Khattab
Christopher Potts
Matei A. Zaharia
RALM
75
118
0
16 Nov 2023
Branch-Solve-Merge Improves Large Language Model Evaluation and
  Generation
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Swarnadeep Saha
Omer Levy
Asli Celikyilmaz
Mohit Bansal
Jason Weston
Xian Li
MoMe
68
77
0
23 Oct 2023
A Critical Evaluation of Evaluations for Long-form Question Answering
A Critical Evaluation of Evaluations for Long-form Question Answering
Fangyuan Xu
Yixiao Song
Mohit Iyyer
Eunsol Choi
ELM
86
103
0
29 May 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
  Form Text Generation
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min
Kalpesh Krishna
Xinxi Lyu
M. Lewis
Wen-tau Yih
Pang Wei Koh
Mohit Iyyer
Luke Zettlemoyer
Hannaneh Hajishirzi
HILMALM
132
693
0
23 May 2023
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning
  by Large Language Models
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Lei Wang
Wanyu Xu
Yihuai Lan
Zhiqiang Hu
Yunshi Lan
Roy Ka-wei Lee
Ee-Peng Lim
ReLMLRM
110
353
0
06 May 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELMALMLM&MA
176
1,205
0
29 Mar 2023
SGPT: GPT Sentence Embeddings for Semantic Search
SGPT: GPT Sentence Embeddings for Semantic Search
Niklas Muennighoff
RALM
155
187
0
17 Feb 2022
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
  Retrieval Models
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur
Nils Reimers
Andreas Rucklé
Abhishek Srivastava
Iryna Gurevych
VLM
425
1,050
0
17 Apr 2021
1