34

Calibrated Confidence Estimation for Tabular Question Answering

Lukas Voss
Main:9 Pages
10 Figures
Bibliography:5 Pages
31 Tables
Appendix:13 Pages
Abstract

Large language models (LLMs) are increasingly deployed for tabular questionanswering, yet calibration on structured data is largely unstudied. Thispaper presents the first systematic comparison of five confidence estimationmethods across five frontier LLMs and two tabular QA benchmarks. All modelsare severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reportedfor textual QA). A consistent self-evaluation versus perturbation dichotomyreplicates across both benchmarks and all four fully-covered models:self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, whileperturbation methods (semantic entropy, self-consistency, and ourMulti-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstraptests reject the null at p<0.001 after Holm-Bonferroni correction, and a3-seed check on GPT-4o-mini gives a per-seed standard deviation of only0.006. The paper proposes Multi-Format Agreement (MFA), which exploits thelossless and deterministic serialization variation unique to structured data(Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API costthan sampling baselines. MFA reduces ECE by 44-63%, generalizes across allfour models on TableBench (mean AUROC 0.80), and combines complementarilywith sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to0.82. A secondary contribution, structure-aware recalibration, improvesAUROC by +10 percentage points over standard post-hoc methods.

View on arXiv
Comments on this paper