Improving uplift model evaluation on RCT data
Estimating treatment effects is one of the most challenging and important tasks of data analysts. Traditional statistical methods aim to estimate average treatment effects over a population. While being highly useful, such average treatment effects do not help to decide which individuals profit most by the treatment. This is where uplift modeling becomes important. Uplift models help to select the right individuals for treatment, to maximize the overall treatment effect (uplift). A challenging problem in uplift modeling is to evaluate the models. Previous literature suggests methods like the Qini curve and the transformed outcome mean squared error. However, these metrics suffer from variance: Their evaluations are strongly affected by random noise in the data, which makes these evaluations to a certain degree arbitrary. Recently, authors suggested the concept of doubly-robust estimation to improve the evaluation of uplift models. However, to justify a change of current state-of-art uplift model evaluation procedures, a comprehensive theoretical analysis as well as empirical evidence is missing. In this paper, we theoretically analyze the variance of uplift evaluation metrics and derive possible methods of variance reduction of which one corresponds to the suggested doubly-robust procedure. We derive simple conditions under which the variance reduction methods improve the uplift evaluation metrics and empirically demonstrate their benefits on simulated data as well as on real-world data. Our paper provides strong evidence to change the current state-of-art uplift evaluation routine on RCT data by using the suggested variance reduction procedures.
View on arXiv