Case Studies

Customer Churn Analysis Using Statistical Data And Python Code

Mrinalini Sundar July 29, 2021

Every company at some point evaluates the churn analysis to understand the company's customer loss rate. With all the research, the company then reduces the customer attrition rate by assessing their product and how customers use it. In this post, we shall be looking at an exciting dataset from Kaggle.  

Data Source

You can find the dataset here at Ecommerce Customer Churn Analysis and Prediction | Kaggle

Understanding the dataset

To understand the data better, we first copy the column description from the above link. We have twenty columns from the dataset. 

CustomerID

Unique customer ID

Churn

Churn Flag

Tenure

Tenure of customer in organization

PreferredLoginDevice

Preferred login device of customer

CityTier

City tier

WarehouseToHome

Distance in between warehouse to home of customer

PreferredPaymentMode

Preferred payment method of customer

Gender

Gender of customer

HourSpendOnApp

Number of hours spend on mobile application or website

NumberOfDeviceRegistered

Total number of deceives is registered on particular customer

PreferedOrderCat

Preferred order category of customer in last month

SatisfactionScore

Satisfactory score of customer on service

MaritalStatus

Marital status of customer

NumberOfAddress

Total number of added added on particular customer

Complain

Any complaint has been raised in last month

OrderAmountHikeFromlastYear

Percentage increases in order from last year

CouponUsed

Total number of coupon has been used in last month

OrderCount

Total number of orders has been places in last month

DaySinceLastOrder

Day Since last order by customer

CashbackAmount

Average cashback in last month

Based on the dataset, there are 5360 observations. The churn column also shows whether a customer has been churned or not. 

In conclusion, the target variable is churn, and all other columns become feature variables.

After the column description and the number of observations are established, let's study the dataset.

Column Types

This dataset is slightly complex as it features both categorical and continuous data. So, we need to treat the columns differently based on their types. For the models to work, the data needs to be numeric only. Below are the columns along with their data types. 

CustomerID

int64

Churn

int64

Tenure

float64

PreferredLoginDevice

object

CityTier

int64

WarehouseToHome

float64

PreferredPaymentMode

object

Gender

object

HourSpendOnApp

float64

NumberOfDeviceRegistered

int64

PreferedOrderCat

object

SatisfactionScore

int64

MaritalStatus

object

NumberOfAddress

int64

Complain

int64

OrderAmountHikeFromlastYear

float64

CouponUsed

float64

OrderCount

float64

DaySinceLastOrder

float64

CashbackAmount

int64

Feature Engineering

Feature Engineering refers to the manipulation of features for statistical purposes. In this case, Feature refers to the columns in the observations excluding the target variable. Every machine learning model uses features for predictions, however, not all the features might be used. For instance, there are a few columns that do not statistically impact the target variable. Removal of unnecessary features helps improve model performance and saves time and memory when training the model.

In this case, the easiest thing to do is to remove the CustomerID column as it is not about dealing with one specific customer. 

The next thing to do is to handle the missing values and remove the non-impacting columns.

Handling Missing Values

This particular dataset has missing values. Fortunately, all these missing values are in the numeric/continuous columns. If that was not the case, the intelligent thing to do is get rid of these observations. But since the columns are constant, one can fill these missing values with the mean of the column. Here's the list of columns and numbers with missing observations.

Column Name

Column Type

No. of Missing Values

Churn

int64

0

Tenure

float64

264

PreferredLoginDevice

object

0

CityTier

int64

0

WarehouseToHome

float64

251

PreferredPaymentMode

object

0

Gender

object

0

HourSpendOnApp

float64

255

NumberOfDeviceRegistered

int64

0

PreferedOrderCat

object

0

SatisfactionScore

int64

0

MaritalStatus

object

0

NumberOfAddress

int64

0

Complain

int64

0

OrderAmountHikeFromlastYear

float64

265

CouponUsed

float64

256

OrderCount

float64

258

DaySinceLastOrder

float64

307

CashbackAmount

int64

0

Data In A Visual Format 

To visualize the data a bit more, one can combine all the categorical variables. Look below for a better understanding: 

Next, identify the outliers. Outliers are some observations that are deviating from the majority of the observations. There can be various reasons why there are outliers in the dataset. For detecting them, it is best to use box plots for each of the continuous variables. For removing the outliers, one can use the Quartile concepts. 

Standard Scaling

All the data in the continuous variable columns need to be scaled. Typically, data needs to be distributed, and one cannot check the correlation between the variables. Use the StandardScaler to scale the data. 

Correlation in Continuous Variables

We use the heatmap for the data finally.

Inferences based on the above heatmap:

CouponUsed and OrderCount are interestingly strongly correlated, making sense as any user with more coupons can place more orders. However, it is just 0.66, so one can accept the data. 

Correlation between Categorical Values

To check how the Categorical Values impact the Churn variable, one can apply the Chi Squared test. The higher the value of Chi Squared test result, the greater the observation affects the target variable.

Below is the result of Chi Squared test results: 

Column

Chi Squared Result

PreferredLoginDevice

0.982755

PreferredPaymentMode

0.999967

Gender

0.063260

PreferedOrderCat

0.996104

MaritalStatus

0.956507

Let us visualize the result:

The test results show that except for Gender, all other columns are independent, evident from the above visualization. So, please remove it from our features. Only Gender is kept in the dataset. 

Labels Encoding

Label Encoding is an essential aspect of machine learning algorithms. Machine Learning models use numbers as they are mathematical models. However, the dataset may contain non-numeric data. For instance, in this dataset, Gender is the categorical variable and once can encode it with the Label Encoding. There are various methods for encoding the categorical data to numeric data.

However, the easiest would be to use the Label Encoder as only Binary values are present in this column. The importance of Gender, which is Male and Female, is converted to 0 or 1.

Test Train Split

Now that the dataset is ready, it's time to split it into training and testing datasets. Train data is the data one uses to train the model. However, the model learns from the data and tunes according to it.

Test data is the data used to test whether the model's performance is acceptable or not. Once one trains the model, one can use the test data to check how accurately the model performs. 

Ideally, one uses 70% to 30%. In other words, thirty percent of the data is used for testing, while seventy percent of the data goes for the training. 

Random Forest Classifier

Now that we have the test and train data, it is time to prepare the model for which one can use the Random Forest Classifier. What is this classifier?

What one is facing is a Supervised Classification problem - which means there are labels against every observation. There are various supervised classification algorithms such as Naive Bayes, Decision Trees, Random Forests, Logistic Regression Classification, Support Vector Machines, etc.

For accurate results, the best decision is to go with the Random Forest Classifier as it uses Decision Trees. A decision tree is a standard method of concluding by splitting the observations recursively. 

An example of Decision Tree for reaching a decision

Decision trees tend to overfit sometimes. However, when a large number of decision trees are combined, they tend to provide results with reasonable accuracy by averaging the results of the different trees. These trees are created by using multiple samples out of the dataset. 

In this case, the Random Forest Classifier is used to train the model. With the help of the built-in library, the estimators are set to be 100. When the model is run, the result is 95% accurate. 

Can we make it better?

Two vital pieces of the data are missing from our e-commerce dataset. 

  1. Age of the Customer: This can add further value to the dataset as age can partition the dataset into different customer classes.
  2. Duration of Engagement: One needs to know how long customers browse the e-commerce site. 
  3. Reason for Churn: Why does a person leave the website?

Conclusion

In the above post, we analyze the data from an e-commerce website. We saw how we could engineer features for our purposes. We filled the missing values by using the most straightforward method. However, this should be considered to know precisely how missing values can be filled. 

The dataset is primarily missing age and duration, which are critical pieces of information. If we look at the market, e-commerce is a relatively new business model and aged people are not on it. This makes it all the more important to know the customer's age as they leave the website. If we had the age group, we would probably create clusters of data using the K-Means algorithm.

Similarly, duration is also an important factor. Knowing how long people browse an e-commerce website can shed light on the quality of service. Duration and Reason for Churn are connected to each other. The reason for churn can help an e-commerce website focus on where to improve. Having the information about the duration of the engagement can reveal if people are leaving the website quickly or not. Many businesses are opening up based on this model. Hence duration can help identify where people spend the most time and if they are loyal to the brand. 

How can MindTrades help?

This case study is only a tipping point to such in-depth analysis with insight and solutions. MindTrades Consulting Services, a leading marketing agency specialises in such case studies for the global IT sector including leading data integration brands such as Diyotta. From Cloud Migration, Big Data, Digital Transformation, Agile Deliver, Cyber Security, to Analytics- MindTrades provides published breakthrough ideas, and prompt content delivery. For more information, check https://www.mindtrades.com

Code

All the code used for this analysis is available at GitHub and can be found here - https://github.com/Mindtrades-Consulting/Customer-Churn-Analysis-Using-Statistical-Data-And-Python-Code/upload

Share On Social Media: