106
37
v1v2v3v4v5v6v7v8v9 (latest)

Data fission: splitting a single data point

Abstract

Suppose we observe a random vector XX from some distribution PP in a known family with unknown parameters. We ask the following question: when is it possible to split XX into two parts f(X)f(X) and g(X)g(X) such that neither part is sufficient to reconstruct XX by itself, but both together can recover XX fully, and the joint distribution of (f(X),g(X))(f(X),g(X)) is tractable? As one example, if X=(X1,,Xn)X=(X_1,\dots,X_n) and PP is a product distribution, then for any m<nm<n, we can split the sample to define f(X)=(X1,,Xm)f(X)=(X_1,\dots,X_m) and g(X)=(Xm+1,,Xn)g(X)=(X_{m+1},\dots,X_n). Rasines and Young (2022) offers an alternative approach that uses additive Gaussian noise -- this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.