Imbalance Data Problem

When using technology to solve real-world challenges, perhaps the most frequent problems are a large number of noise and extreme data imbalances in unimaginable forms. In this blog, we would like to share our efforts to resolve data imbalances.

1 . What is imbalanced data?

1-1 . A notion

Imbalanced data refers to data that significantly differentiates the number of observations in the normal category and the number of observations in the abnormal category.

For example, there are significantly fewer cases with cancer than those who don’t get cancer, and significantly fewer cases of credit card fraud than normal transactions.

These data can be seen as unbalanced data.

1-2 . The point at issue

It is generally more important to categorize the abnormalities accurately, between accurately classifying the normal and accurately classifying the abnormalities.

This is because abnormal data is usually the target value.

When you look at the picture, blue represents normal observations, red represents abnormal observations, and gray represents the distribution of actual abnormal data.

In other words, a gray circle is unknown data that has not yet been observed.

If you learn with only blue and red data known, the classification boundaries will be drew as shown in the figure above.

However, the gray circles on the left side of the boundary are actually abnormal data, so they are misclassified as normal data.

The boundary must be granged between a blue circle and a gray circle to be an ideal boundary.

In other words, there is a problem that an unbalanced dataset may not be able to pinpoint abnormal data.

Mad Scientist Yeon