Calibrated Confidence Estimation for Tabular Question Answering

14 April 2026

Lukas Voss

LMTD

ELM

ArXiv (abs)PDF HTML Github

Main:9 Pages

10 Figures

Bibliography:5 Pages

31 Tables

Appendix:13 Pages

Abstract

Large language models (LLMs) are increasingly deployed for tabular questionanswering, yet calibration on structured data is largely unstudied. Thispaper presents the first systematic comparison of five confidence estimationmethods across five frontier LLMs and two tabular QA benchmarks. All modelsare severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reportedfor textual QA). A consistent self-evaluation versus perturbation dichotomyreplicates across both benchmarks and all four fully-covered models:self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, whileperturbation methods (semantic entropy, self-consistency, and ourMulti-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstraptests reject the null at p<0.001 after Holm-Bonferroni correction, and a3-seed check on GPT-4o-mini gives a per-seed standard deviation of only0.006. The paper proposes Multi-Format Agreement (MFA), which exploits thelossless and deterministic serialization variation unique to structured data(Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API costthan sampling baselines. MFA reduces ECE by 44-63%, generalizes across allfour models on TableBench (mean AUROC 0.80), and combines complementarilywith sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to0.82. A secondary contribution, structure-aware recalibration, improvesAUROC by +10 percentage points over standard post-hoc methods.

View on arXiv

Comments on this paper