Dealing with Imbalanced Data

By | 27th April 2020

As we’ve seen from the previous post that the target variable Churn is imbalanced. We have about 27% churn records in the Telco Customer Churn data. It is not a highly imbalanced dataset. I typically would deal with the imbalance issue if the minority class is 10% or less. But want to use this data set to demonstrate the benefits of treating the imbalance issue even when it’s not that much of an issue. We have many alternatives to tackle this problem:

  • Over-sampling
  • Under-sampling
  • Costs sensitive approach
  • SMOTE — Synthetic Minority Over-Sampling Techniques

There are no single best technique. Generally, we need to experiment with a few of them before deciding on one. Here we will try under, over-sampling and SMOTE technique. Under-sampling, as the name suggests, is to under sample the majority class. Over-sampling is to over sample the minority class. SMOTE technique generates synthetic data for the minority class.

We will start by loading the necessary libraries and do some data cleaning.

library(tidyverse)
library(unbalanced)
library(caret)
library(caretEnsemble)

customer_churn_raw_tbl <- read.csv("~yourdirectory/Telco_Customer_Churn.csv")

## pre-process the data
customer_processed <- customer_churn_raw_tbl %>%
  drop_na(TotalCharges) %>%
  mutate(churnind=ifelse(Churn=='No',0,1)) %>%
  mutate(churnind = as.factor(churnind))

Then we proceed by creating the balanced data sets.

independent_var <- customer_processed[,-c(1,21,22)]
target_var <- customer_processed$churnind

## balance data using over sample
data <- ubOver(X=independent_var, Y=target_var)
os_customer <- cbind(data$X, data$Y)
str(os_customer)
os_customer$churn <- os_customer$'data$Y'
os_customer <- os_customer[,-c(20)]

## balance data using under sample
data <- ubUnder(X=independent_var, Y=target_var, perc = 40, method = "percPos")
us_customer <-cbind(data$X, data$Y)
us_customer$churn <- us_customer$'data$Y'
us_customer <- us_customer[,-c(20)]


##balance data with SMOTE
data <- ubSMOTE(X=independent_var, Y=target_var)
smote_customer <- cbind(data$X, data$Y)
smote_customer$churn <-smote_customer$'data$Y'
smote_customer <- smote_customer[,-c(20)]

We drew bar plots of the churn variables to check on the distributions in each sample. They are much balanced in the over-sample, under-sample, SMOTE data compare with the original data.

par(mfrow=c(2,2))
barplot(table(customer_processed[,22]), main="Original Data Churn Distribution", xlab='churn')
barplot(table(os_customer[,20]), main="Over-sampled Data Churn Distribution", xlab=colnames(os_customer)[20])
barplot(table(us_customer[,20]), main="Under-sampled Data Churn Distribution", xlab=colnames(us_customer)[20])
barplot(table(smote_customer[,20]), main="SMOTE Data Churn Distribution", xlab=colnames(smote_customer)[20])

We trained Naive Bayers and Random Forest models on the over-sampled, under-sampled, smote data and the original data set. Then used 10 fold cross validation to compare the modeling results. The accuracy results from each of the models are collected. Each “winning” model has 100 results because we ran 10 repeats of the 10-fold cross validation. We compare the accuracy distributions (100 values) among the models.

### create 10 cross fold validation
set.seed(42)
control <- trainControl(method="repeatedcv", number=10,repeats=10,verboseIter = FALSE, savePredictions=TRUE)
metric  <- "Accuracy"
algorithmList <- c('nb','rf')

##models with undersampling data
us_models<-caretList(y=us_customer$churn, x = us_customer[,-20], metric=metric, trControl=control, methodList = algorithmList)

##models with Oversampling data
os_models <- caretList(y=os_customer$churn, x = os_customer[,-20], metric=metric, trControl=control, methodList = algorithmList)

##models with SMOTE data
smote_models <-caretList(y=smote_customer$churn, x = smote_customer[,-20], metric=metric, trControl=control, methodList = algorithmList)

##original unbalanced data
original_models <-caretList(y=target_var, x = independent_var, metric=metric, trControl=control, methodList = algorithmList)

##### compare the models ######
models_compare <- resamples(list(us_NB = us_models$nb, os_NB = os_models$nb, smote_NB = smote_models$nb, unbalanced_NB = original_models$nb, us_RF = us_models$rf, os_RF=os_models$rf, smote_RF=smote_models$rf, unbalanced_RF=original_models$rf))

# Summary of the models performances
summary(models_compare)

##### create a boxplot to compare models ######
# Draw box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)

The results show that the Random Forest algorithms on over-sampled data has the highest accuracy around 90%. RF on smote data is a close second with 89% accuracy rate. RF on the original imbalanced data is the third with 80% accuracy.

End Notes

This post shows that the data balancing techniques can be useful for data sets that do not suffer much from the imbalance issue. We need to experiment to get the best suited sampling techniques. In this case, the over-sampling method won. However, we need to experiment with different techniques on different data sets. Dealing with the imbalanced data isn’t difficult for R users as there are many powerful packages such as the caret package used in this post. Some packages such as ROSE are available to deal with the imbalanced data issue as well.

One thought on “Dealing with Imbalanced Data

Leave a Reply

Your email address will not be published. Required fields are marked *