Techniques and Methods of Handling Unbalanced Data

Author: Leonel Silima

Date of Publication: 02/02/2023

This post may contain affiliate links, which means I may receive a small commission, at no cost to you, if you make a purchase through a link

Currently financial management software and platforms are built under artificial intelligence algorithms. In fact, these solutions come from raw data that is pre-processed and then subjected to machine learning algorithms. However, the question is: What is the problem arising from these applications and what are their motivations?

What is the need to handle unbalanced data?

One of the biggest problems with machine learning platforms or applications has been the inefficiency of their performance. This is often the result of not applying the best techniques on unbalanced data. However, the applications end up showing a greater probability of assertiveness in the majority of instances and fail in the minority of them . Because of this, unbalanced data applications present a greater risk of assertiveness only in one direction and not when applied in the opposite direction. Moreover, the quality of a machine learning model starts with the quality of the data in the pre-processing stage. However, when this step isn’t observed, there is a risk of generating misleading models, which bring difficulties in forecasting. Next, we will talk about the main techniques and methods to be used in the process of handling unbalanced data.

Unbalanced Data

Unbalanced data can be defined by the small incidence of a category within a dataset (minority class) compared to the other categories (majority classes). For example, the following image illustrates the characteristics of unbalanced data.

an illustration of the concept of non-balance related to data

However, if we develop a machine learning model without considering the disproportionality between the majority and minority instances, the model will fall victim to the precision_paradoxo. In particular, in this case the algorithm parameters won’t differentiate the minority class from the other categories, resulting in better results in the tests. On the other hand, worse outcomes arise when applied in real situations. Below we present a predictive model of bank loan payments based on unbalanced data.

a graphic descriptive of the pay and non pay loan frequency

As we see in this example, the incidence of the not pay loan category is dominant, being found in 1,400 of the cases. So predicting that each case is in the not pay loan category will be more accurate for 1,400 customers than for 200 customers in the pay loan category. In fact, this means that of the 1,400 customers, the model understands that only 14.2% are those who repay loans. Moreover, precision and recall are better measures in such cases. The underlying issue is that there is a class imbalance between the positive class and the negative class. Therefore, prior probabilities for these classes need to be accounted for in error analysis. So, precision and recall help, but precision can be biased by very unbalanced class priors in the test sets too.

Techniques and Methods of Handling Unbalanced Data

We can subdivide unbalanced data treatment into 2 large groups: OverSampling (increasing the minority instance) and UnderSampling (reducing the majority instance).

1. UnderSampling techniques

UnderSampling is a method that consists of reducing the number of observations of the class with greater frequency. it aims to reduce the difference between categories. This method has the following cases:

Random UnderSampling

Technique that consists of randomly removing data from the class more frequently which leads to a serious loss of information.

NearMiss – 3

This technique adds some heuristic rules based on the nearest neighbor’s algorithm (knn) and the acceptance classifier derived from scikit-learn. In particular, NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their M nearest neighbors will be kept. Then, the positive samples selected are the one for which the average distance to the N nearest-neighbors is the largest. As we can see in the figure below.

graphic with minority and majority class, samples, average distance .

In addition, we should note that this technique reduces data loss contrary to previous practice, however this is the most recommended for UnderSampling.

2. OverSampling techniques

OverSampling consists of synthetically creating observations of the minority class, aiming to match the proportion of categories. However, this technique offers good results when the minority class doesn’t have a very large quantitative variation in its parameters. Otherwise, the model may be very good at identifying specific cases of the minority class and not the category as a whole. Among the methods, the following apply:

SMOTE (Synthetic Minority Oversampling Technique) - Oversampling

Smote is one of the most used oversampling methods to solve the imbalance problem. Specifically, it aims to balance the distribution of classes with random increase of examples of minority classes, replicating them. However, we can identify inconsistencies in this technique, so we should make improvements creating its following subset.

SMOTE + ENN

SMOTE + ENN consists of a combination of a synthetic data generation technique and a noisy data cleaning technique to avoid class overlap. This technique performs the generation of synthetic data in the minority class followed by the cleaning of misclassified instances in both classes. Thus, this makes the data more standardized for machine learning algorithms.

Next, we observe the characteristics of the data before and after applying the Smote technique.

SMOTE= Synthetic Minority Oversampling Technique graphic

Final remarks

Finally, there are many other techniques for treating unbalanced data. They can apply to both UnderSampling and OverSampling data, as well as other subsets of the Smote method. Furthermore, during the process of handling unbalanced data, we should consider all the above alternatives and the method that presents the best results. In addition, it’s interesting to note that you should not use more than one type of transformation at the same time. This prevents the loss of the original information.

In the next article we will talk about processing financial data with machine learning algorithms.

If you are looking for a secure, well-known platform to invest on, Binance is your website!

Reference List:

Dye, S. (02 de May de 2020). Towards Data Science. How to Handle SMOTE Data in Imbalanced Classification Problems.
imbalanced-learn. (27 de 12 de 2022). imbalanced-learn.org. Obtido de imbalanced-learn.org: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.CondensedNearestNeighbour.html#imblearn.under_sampling.CondensedNearestNeighbour
scikit-learn. (30 de 12 de 2022). scikit-learn. Obtido de scikit-learn.org: https://scikit-learn.org/stable/modules/tree.html
https://towardsdatascience.com/how-to-handle-smote-data-in-imbalanced-classification-problems-cf4b86e8c6a1
https://en.wikipedia.org/wiki/Accuracy_paradox
https://learn.microsoft.com/pt-br/azure/machine-learning/component-reference/smote
https://medium.com/turing-talks/dados-desbalanceados-o-que-s%C3%A3o-e-como-evit%C3%A1-los-43df4f49732b