22
2

Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

Abstract

Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on SMOTE density. From these results, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We confirm and illustrate empirically this first theoretical behavior on a real-worldthis http URL, we prove that SMOTE density vanishes near the boundary of the support of the minority class distribution. We then adapt SMOTE based on our theoretical findings to introduce two new variants. These strategies are compared on 13 tabular data sets with 10 state-of-the-art rebalancing procedures, including deep generative and diffusion models. One of our key findings is that, for most data sets, applying no rebalancing strategy is competitive in terms of predictive performances, would it be with LightGBM, tuned random forests or logistic regression. However, when the imbalance ratio is artificially augmented, one of our two modifications of SMOTE leads to promising predictive performances compared to SMOTE and other state-of-the-art strategies.

View on arXiv
@article{sakho2025_2402.03819,
  title={ Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants },
  author={ Abdoulaye Sakho and Emmanuel Malherbe and Erwan Scornet },
  journal={arXiv preprint arXiv:2402.03819},
  year={ 2025 }
}
Comments on this paper