16
0

How Much is Unseen Depends Chiefly on Information About the Seen

Abstract

The missing mass refers to the proportion of data points in an unknown population of classifier inputs that belong to classes not present in the classifier's training data, which is assumed to be a random sample from that unknown population. We find that in expectation the missing mass is entirely determined by the number fkf_k of classes that do appear in the training data the same number of times and an exponentially decaying error. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80% of the Good-Turing estimator's.

View on arXiv
@article{lee2025_2402.05835,
  title={ How Much is Unseen Depends Chiefly on Information About the Seen },
  author={ Seongmin Lee and Marcel Böhme },
  journal={arXiv preprint arXiv:2402.05835},
  year={ 2025 }
}
Comments on this paper