Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.19253
Cited By
v1
v2 (latest)
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
25 May 2025
Joao Coelho
Jingjie Ning
Jingyuan He
Kangrui Mao
Abhijay Paladugu
Pranav Setlur
Jiahe Jin
Jamie Callan
João Magalhães
Bruno Martins
Chenyan Xiong
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research"
21 / 21 papers shown
Title
Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
Nandan Thakur
Ronak Pradeep
Shivani Upadhyay
Daniel Fernando Campos
Nick Craswell
Jimmy Lin
ELM
97
3
0
21 Apr 2025
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Salaheddin Alzubi
Creston Brooks
Purva Chiniya
Edoardo Contente
Chiara von Gerlach
...
Arda Kaz
Windsor Nguyen
Sewoong Oh
Himanshu Tyagi
Pramod Viswanath
VLM
ELM
LRM
152
12
0
26 Mar 2025
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin
Hansi Zeng
Zhenrui Yue
Dong Wang
Sercan O. Arik
Dong Wang
Hamed Zamani
Jiawei Han
RALM
ReLM
KELM
OffRL
AI4TS
LRM
200
120
0
12 Mar 2025
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song
Jinhao Jiang
Yingqian Min
Jie Chen
Zhongfu Chen
Wayne Xin Zhao
Lei Fang
Ji-Rong Wen
AI4TS
LRM
KELM
168
43
0
07 Mar 2025
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Junde Wu
Jiayuan Zhu
Yuyuan Liu
LRM
99
25
0
07 Feb 2025
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Satyapriya Krishna
Kalpesh Krishna
Anhad Mohananey
Steven Schwarcz
Adam Stambler
Shyam Upadhyay
Manaal Faruqui
ReLM
3DV
LRM
RALM
81
30
0
28 Jan 2025
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
Yujia Zhou
Yan Liu
Xiaoxi Li
Jiajie Jin
Hongjin Qian
Zheng Liu
Chaozhuo Li
Zhicheng Dou
Tsung-Yi Ho
Philip S. Yu
3DV
RALM
83
39
0
16 Sep 2024
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
Dongyu Ru
Lin Qiu
Xiangkun Hu
Tianhang Zhang
Peng Shi
...
Tong He
Zhiguo Wang
Pengfei Liu
Yue Zhang
Zheng Zhang
95
20
0
15 Aug 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
117
259
0
25 Jun 2024
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu
Yuge Tu
Xu Han
Chaoqun He
Ganqu Cui
...
Chaochao Jia
Guoyang Zeng
Dahai Li
Zhiyuan Liu
Maosong Sun
MoE
101
342
0
09 Apr 2024
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Parishad BehnamGhader
Vaibhav Adlakha
Marius Mosbach
Dzmitry Bahdanau
Nicolas Chapados
Siva Reddy
94
238
0
09 Apr 2024
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents
Corby Rosset
Ho-Lam Chung
Guanghui Qin
Ethan C. Chau
Zhuo Feng
Ahmed Hassan Awadallah
Jennifer Neville
Nikhil Rao
57
14
0
27 Feb 2024
GAIA: a benchmark for General AI Assistants
Grégoire Mialon
Clémentine Fourrier
Craig Swift
Thomas Wolf
Yann LeCun
Thomas Scialom
AI4MH
ALM
ELM
RALM
81
182
0
21 Nov 2023
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
Jon Saad-Falcon
Omar Khattab
Christopher Potts
Matei A. Zaharia
RALM
75
118
0
16 Nov 2023
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Swarnadeep Saha
Omer Levy
Asli Celikyilmaz
Mohit Bansal
Jason Weston
Xian Li
MoMe
68
77
0
23 Oct 2023
A Critical Evaluation of Evaluations for Long-form Question Answering
Fangyuan Xu
Yixiao Song
Mohit Iyyer
Eunsol Choi
ELM
86
103
0
29 May 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min
Kalpesh Krishna
Xinxi Lyu
M. Lewis
Wen-tau Yih
Pang Wei Koh
Mohit Iyyer
Luke Zettlemoyer
Hannaneh Hajishirzi
HILM
ALM
132
693
0
23 May 2023
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Lei Wang
Wanyu Xu
Yihuai Lan
Zhiqiang Hu
Yunshi Lan
Roy Ka-wei Lee
Ee-Peng Lim
ReLM
LRM
110
353
0
06 May 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELM
ALM
LM&MA
176
1,205
0
29 Mar 2023
SGPT: GPT Sentence Embeddings for Semantic Search
Niklas Muennighoff
RALM
155
187
0
17 Feb 2022
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur
Nils Reimers
Andreas Rucklé
Abhishek Srivastava
Iryna Gurevych
VLM
425
1,050
0
17 Apr 2021
1