ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.18562
94
1

Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

24 March 2025
Nariman Naderi
Seyed Amir Ahmad Safavi-Naini
Thomas Savage
Zahra Atf
Peter Lewis
Girish Nadkarni
Ali Soroush
    ELM
ArXivPDFHTML
Abstract

This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

View on arXiv
@article{naderi2025_2503.18562,
  title={ Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models },
  author={ Nariman Naderi and Seyed Amir Ahmad Safavi-Naini and Thomas Savage and Zahra Atf and Peter Lewis and Girish Nadkarni and Ali Soroush },
  journal={arXiv preprint arXiv:2503.18562},
  year={ 2025 }
}
Comments on this paper