22
0

Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK

Main:15 Pages
18 Figures
Bibliography:2 Pages
4 Tables
Appendix:16 Pages
Abstract

Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use. We present CHECK, a continuous-learning framework that integrates structured clinical databases with a classifier grounded in information theory to detect both factual and reasoning-based hallucinations. Evaluated on 1500 questions from 100 pivotal clinical trials, CHECK reduced LLama3.3-70B-Instruct hallucination rates from 31% to 0.3% - making an open source model state of the art. Its classifier generalized across medical benchmarks, achieving AUCs of 0.95-0.96, including on the MedQA (USMLE) benchmark and HealthBench realistic multi-turn medical questioning. By leveraging hallucination probabilities to guide GPT-4o's refinement and judiciously escalate compute, CHECK boosted its USMLE passing rate by 5 percentage points, achieving a state-of-the-art 92.1%. By suppressing hallucinations below accepted clinical error thresholds, CHECK offers a scalable foundation for safe LLM deployment in medicine and other high-stakes domains.

View on arXiv
@article{garcia-fernandez2025_2506.11129,
  title={ Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK },
  author={ Carlos Garcia-Fernandez and Luis Felipe and Monique Shotande and Muntasir Zitu and Aakash Tripathi and Ghulam Rasool and Issam El Naqa and Vivek Rudrapatna and Gilmer Valdes },
  journal={arXiv preprint arXiv:2506.11129},
  year={ 2025 }
}
Comments on this paper