Towards a statistical theory of data selection under weak supervision

Given a sample of size , it is often useful to select a subsample of smaller size to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given unlabeled samples , and to be given access to a `surrogate model' that can predict labels better than random guessing. Our goal is to select a subset of the samples, to be denoted by , of size . We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: ~Data selection can be very effective, in particular beating training on the full sample in some cases; ~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
View on arXiv