MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at:this https URL
View on arXiv@article{zhang2025_2504.09702, title={ MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? }, author={ Yunxiang Zhang and Muhammad Khalifa and Shitanshu Bhushan and Grant D Murphy and Lajanugen Logeswaran and Jaekyeom Kim and Moontae Lee and Honglak Lee and Lu Wang }, journal={arXiv preprint arXiv:2504.09702}, year={ 2025 } }