Near-Optimal Procedures for Model Discrimination with Non-Disclosure Properties

Let be the population risk minimizers associated to some loss and two distributions on . The models are unknown, and can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "What sizes of the samples from and allow to distinguish between the two hypotheses and for given ?" Making the first steps towards answering it in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity as given by up to a constant factor; here is a measure of separation between and and is the rank of the design covariance matrix. We then extend this result in two directions: (i) for general parametric models in asymptotic regime; (ii) for generalized linear models in small samples () under weak moment assumptions. In both cases we derive sample complexity bounds of a similar form while allowing for model misspecification. In fact, our testing procedures only access via a certain functional of empirical risk. In addition, the number of observations that allows us to reach statistical confidence does not allow to "resolve" the two models that is, recover up to prediction accuracy. These two properties allow to use our framework in applied tasks where one would like to a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually by the identifying agent.
View on arXiv