Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases -- trials, primary studies, regulatory documents, and cost data -- under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent's ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at:this https URL
View on arXiv@article{chen2025_2505.14963, title={ MedBrowseComp: Benchmarking Medical Deep Research and Computer Use }, author={ Shan Chen and Pedro Moreira and Yuxin Xiao and Sam Schmidgall and Jeremy Warner and Hugo Aerts and Thomas Hartvigsen and Jack Gallifant and Danielle S. Bitterman }, journal={arXiv preprint arXiv:2505.14963}, year={ 2025 } }