Class Imbalance in a Dataset


When a dataset has a predictor variable that contains more instances of one outcome than another, we have a class imbalance. The class with more instances is called the majority class and the class with less instances is called the minority class. This may not be a problem when you are training your data. If you have a good number of observations, an 80-20 split may be fine. If you have a 90-20 or worst split, you will only know if there is a problem after you see the results of your model.

For classification problems, you need to understand the frequency of the response variable. One example of an unequal dataset is in the case of fraud detection. Your business need is to predict when a transaction is fraudulent. In your dataset you may have millions of examples of transactions with only a thousand of them proving to be fraudulent.

There are two techniques we can use to help fix a class imbalance: upsampling and downsampling. We want to preserves the information contained in the data while removing the imbalance. Downsampling involves altering the majority class by using less of the original dataset to produce a split that’s more even. Upsampling is the opposite of downsampling. Instead of reducing the frequency of the majority class, you artificially increase the frequency of the minority class. There are different techniques to do this.

Which one should you choose? Generally, downsampling is normally more effective when working with extremely large datasets. Extremely large could mean millions of data observations. Upsampling can be better when working with a small dataset, where small may mean a few thousand rows. Class balancing may require some trial and error. Building separate models with both upsample data and downsample data will determine which technique is better in any given situation.

Leave a comment

Your email address will not be published. Required fields are marked *