ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1901.10173
59
72

Bayes Imbalance Impact Index: A Measure of Class Imbalanced Dataset for Classification Problem

29 January 2019
Yang Lu
Yiu-ming Cheung
Yuanyan Tang
ArXivPDFHTML
Abstract

Recent studies have shown that imbalance ratio is not the only cause of the performance loss of a classifier in imbalanced data classification. In fact, other data factors, such as small disjuncts, noises and overlapping, also play the roles in tandem with imbalance ratio, which makes the problem difficult. Thus far, the empirical studies have demonstrated the relationship between the imbalance ratio and other data factors only. To the best of our knowledge, there is no any measurement about the extent of influence of class imbalance on the classification performance of imbalanced data. Further, it is also unknown for a dataset which data factor is actually the main barrier for classification. In this paper, we focus on Bayes optimal classifier and study the influence of class imbalance from a theoretical perspective. Accordingly, we propose an instance measure called Individual Bayes Imbalance Impact Index (IBI3IBI^3IBI3) and a data measure called Bayes Imbalance Impact Index (BI3BI^3BI3). IBI3IBI^3IBI3 and BI3BI^3BI3 reflect the extent of influence purely by the factor of imbalance in terms of each minority class sample and the whole dataset, respectively. Therefore, IBI3IBI^3IBI3 can be used as an instance complexity measure of imbalance and BI3BI^3BI3 is a criterion to show the degree of how imbalance deteriorates the classification. As a result, we can therefore use BI3BI^3BI3 to judge whether it is worth using imbalance recovery methods like sampling or cost-sensitive methods to recover the performance loss of a classifier. The experiments show that IBI3IBI^3IBI3 is highly consistent with the increase of prediction score made by the imbalance recovery methods and BI3BI^3BI3 is highly consistent with the improvement of F1 score made by the imbalance recovery methods on both synthetic and real benchmark datasets.

View on arXiv
Comments on this paper