Bookmarked Articles

Predicting CLV (demo Python)

CLV

Predicting CLV (demo Python)

Customer Lifetime Value prediction

The problem

In this notebook we look at the data we got via this Kaggle dataset (CreditCard_dataset). It involves the car insurance customer lifetime value.

Customer Lifetime Value Prediction( CLV ) value refers to net profit attributed to the entire future relationship with a customer. A bank will use different predictive analytic approaches to predict the revenue that can be generated from any customer in the future. This helps the banks in segmentating the customers in specific groups based on their CLV.

Identifying customers with high future values will enable the organization to keep maintaining good relationships with such customers. It can be done by investing more time and resources on them such as better prices, offers, discounts, customer care services, etc.

Finding and engaging reliable and profitable customers has always been a great challenge for banks. With the increasing competition, the banks need to keep a check on each and every activity of their customers for utilizing their resources effectively.

To solve this problem, Data Science in banking is being used for extracting actionable insights concerning customer behaviors and expectations. Using Data Science models for predicting the CLV of a customer will help a bank to take some suitable decisions for their growth and profit.

CLV

Import the important libraries / packages

These packages are needed to load and use the dataset

In [1]:

import pandas as pd #we use this to load, read and transform the dataset
import numpy as np #we use this for statistical analysis
import matplotlib.pyplot as plt #we use this to visualize the dataset
import seaborn as sns #we use this to make countplots
import sklearn.metrics as sklm #This is to test the models

Load and explore the dataset

The data is all in one csv file. In this next step I will first load the data to see how this looks like

In [2]:

#here we load the data
data = pd.read_csv('/kaggle/input/credit-card-data/Fn-UseC_-Marketing-Customer-Value-Analysis.csv')
#and immediately I would like to see how this dataset looks like
data.head()

Out[2]:

Customer State Customer Lifetime Value Response Coverage Education Effective To Date EmploymentStatus Gender Income Months Since Policy Inception Number of Open Complaints Number of Policies Policy Type Policy Renew Offer Type Sales Channel Total Claim Amount Vehicle Class Vehicle Size
0 BU79786 Washington 2763.519279 No Basic Bachelor 2/24/11 Employed F 56274 5 0 1 Corporate Auto Corporate L3 Offer1 Agent 384.811147 Two-Door Car Medsize
1 QZ44356 Arizona 6979.535903 No Extended Bachelor 1/31/11 Unemployed F 0 42 0 8 Personal Auto Personal L3 Offer3 Agent 1131.464935 Four-Door Car Medsize
2 AI49188 Nevada 12887.431650 No Premium Bachelor 2/19/11 Employed F 48767 38 0 2 Personal Auto Personal L3 Offer1 Agent 566.472247 Two-Door Car Medsize
3 WW63253 California 7645.861827 No Basic Bachelor 1/20/11 Unemployed M 0 65 0 7 Corporate Auto Corporate L2 Offer1 Call Center 529.881344 SUV Medsize
4 HB64268 Washington 2813.692575 No Basic Bachelor 2/3/11 Employed M 43836 44 0 1 Personal Auto Personal L1 Offer1 Agent 138.130879 Four-Door Car Medsize

5 rows × 24 columns

In [3]:

#now let’s look closer at the dataset we got

data.info()

Text Description automatically generated

It seems that we have a lot of text / category information (these are of the Dtype ‘object’) and a few numerical columns (Dtypes ‘int64’ and ‘float64’).

The column ‘Customer Lifetime Value’ is the column we would like to predict.

In [4]:

data.shape

Out[4]:

(9134, 24)

The dataset consists of 9134 rows and 24 columns.

In [5]:

data.describe()

Out[5]:

Customer Lifetime Value Income Monthly Premium Auto Months Since Last Claim Months Since Policy Inception Number of Open Complaints Number of Policies Total Claim Amount
count 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000
mean 8004.940475 37657.380009 93.219291 15.097000 48.064594 0.384388 2.966170 434.088794
std 6870.967608 30379.904734 34.407967 10.073257 27.905991 0.910384 2.390182 290.500092
min 1898.007675 0.000000 61.000000 0.000000 0.000000 0.000000 1.000000 0.099007
25% 3994.251794 0.000000 68.000000 6.000000 24.000000 0.000000 1.000000 272.258244
50% 5780.182197 33889.500000 83.000000 14.000000 48.000000 0.000000 2.000000 383.945434
75% 8962.167041 62320.000000 109.000000 23.000000 71.000000 0.000000 4.000000 547.514839
max 83325.381190 99981.000000 298.000000 35.000000 99.000000 5.000000 9.000000 2893.239678

It seems that we have some strange outliers for the CLV and claim amounts. We will look and handle these later on.

In [6]:

data.describe(include=’O’)

Out[6]:

Customer State Response Coverage Education Effective To Date EmploymentStatus Gender Location Code Marital Status Policy Type Policy Renew Offer Type Sales Channel Vehicle Class Vehicle Size
count 9134 9134 9134 9134 9134 9134 9134 9134 9134 9134 9134 9134 9134 9134 9134 9134
unique 9134 5 2 3 5 59 5 2 3 3 3 9 4 4 6 3
top YD27780 California No Basic Bachelor 1/10/11 Employed F Suburban Married Personal Auto Personal L3 Offer1 Agent Four-Door Car Medsize
freq 1 3150 7826 5568 2748 195 5698 4658 5779 5298 6788 3426 3752 3477 4621 6424

In [7]:

#Let's see what the options are in the text columns with two or three options (the objects)
print('Response: '+ str(data['Response'].unique()))
print('Coverage: '+ str(data['Coverage'].unique()))
print('Education: '+ str(data['Education'].unique()))
print('Employment Status: '+ str(data['EmploymentStatus'].unique()))
print('Gender: ' + str(data['Gender'].unique()))
print('Location Code: ' + str(data['Location Code'].unique()))
print('Married: ' + str(data['Marital Status'].unique()))
print('Policy Type: ' + str(data['Policy Type'].unique()))
print('Vehicle Size: ' + str(data['Vehicle Size'].unique()))

Text Description automatically generated

Customer Lifetime Value

As Customer Lifetime Value is the column we want to predict, let’s explore this column in the training dataset.

The formula to calculate the CLV:

CLV formula

In [8]:

#As this is a numeric, thus continous number, I will use a scatterplot to see if there is a pattern.
plt.hist(data['Customer Lifetime Value'], bins = 10)
plt.title("Customer Lifetime Value") #Assign title
plt.xlabel("Value") #Assign x label
plt.ylabel("Customers") #Assign y label
plt.show()

Chart, histogram Description automatically generated

In [9]:

plt.boxplot(data[‘Customer Lifetime Value’])

Out[9]:

Text Description automatically generated

A picture containing box and whisker chart Description automatically generated

In [10]:

#We see that there are some great outliers here.

#let’s look closer to these outliers over 50000

outliers = data[data[‘Customer Lifetime Value’] > 50000]

outliers.head(25)

Out[10]:

Customer State Customer Lifetime Value Response Coverage Education Effective To Date EmploymentStatus Gender Income Months Since Policy Inception Number of Open Complaints Number of Policies Policy Type Policy Renew Offer Type Sales Channel Total Claim Amount Vehicle Class Vehicle Size
79 OM82309 California 58166.55351 No Basic Bachelor 2/27/11 Employed M 61321 30 1 2 Personal Auto Personal L3 Offer2 Branch 427.631210 Luxury Car Small
1974 YC54142 Washington 74228.51604 No Extended High School or Below 1/26/11 Unemployed M 0 34 0 2 Personal Auto Personal L1 Offer1 Branch 1742.400000 Luxury Car Medsize
2190 KI58952 California 51337.90677 No Premium College 2/24/11 Employed F 72794 47 1 2 Personal Auto Personal L2 Offer1 Web 50.454459 SUV Large
2908 EN65835 Arizona 58753.88046 No Premium Bachelor 1/6/11 Employed F 24964 84 0 2 Personal Auto Personal L2 Offer2 Agent 888.000000 SUV Medsize
3145 CL79250 Nevada 52811.49112 No Basic Bachelor 1/8/11 Unemployed M 0 70 0 2 Corporate Auto Corporate L2 Offer2 Agent 873.600000 Luxury Car Small
3760 AZ84403 Oregon 61850.18803 No Extended College 2/4/11 Unemployed F 0 29 0 2 Personal Auto Personal L1 Offer3 Branch 1142.400000 Luxury SUV Medsize
4126 JT47995 Arizona 60556.19213 No Extended College 1/1/11 Unemployed F 0 45 0 2 Personal Auto Personal L3 Offer1 Web 979.200000 Luxury SUV Large
4915 DU50092 Oregon 56675.93768 No Premium College 1/24/11 Employed F 77237 93 0 2 Personal Auto Personal L1 Offer4 Web 1358.400000 Luxury SUV Medsize
5279 SK66747 Washington 66025.75407 No Basic Bachelor 2/22/11 Employed M 33481 46 0 2 Personal Auto Personal L3 Offer1 Agent 1194.892002 Luxury SUV Medsize
5716 FQ61281 Oregon 83325.38119 No Extended High School or Below 1/31/11 Employed M 58958 74 0 2 Personal Auto Personal L3 Offer1 Call Center 1108.800000 Luxury Car Small
6252 BP23267 California 73225.95652 No Extended Bachelor 2/9/11 Employed F 39547 21 0 2 Personal Auto Personal L3 Offer1 Branch 969.600000 Luxury SUV Medsize
6461 OY68395 Oregon 55277.44589 No Basic High School or Below 1/30/11 Employed F 40740 60 0 2 Personal Auto Personal L2 Offer1 Web 950.400000 Luxury SUV Large
6554 AH58807 Arizona 51426.24815 No Basic College 1/9/11 Employed F 84650 39 3 2 Personal Auto Personal L2 Offer1 Agent 660.474274 Luxury Car Medsize
6569 LW64678 California 51016.06704 No Premium Master 2/19/11 Employed F 25167 76 0 2 Personal Auto Personal L3 Offer2 Agent 422.494292 SUV Small
6584 XF89906 Arizona 58207.12842 No Extended High School or Below 1/13/11 Disabled M 29295 50 0 2 Personal Auto Personal L3 Offer1 Agent 1328.839129 Luxury SUV Large
7283 KH55886 Oregon 67907.27050 No Premium Bachelor 2/5/11 Employed M 78310 18 1 2 Personal Auto Personal L1 Offer1 Agent 151.711475 Sports Car Medsize
7303 FB95288 California 64618.75715 No Extended High School or Below 1/17/11 Unemployed M 0 40 1 2 Personal Auto Personal L3 Offer1 Branch 1562.400000 Luxury Car Small
7556 JZ23377 Oregon 57520.50151 No Premium College 1/20/11 Employed F 48367 34 0 2 Personal Auto Personal L3 Offer2 Branch 772.800000 SUV Medsize
7835 QT84069 Oregon 50568.25912 No Extended Master 2/28/11 Employed M 82081 62 0 2 Personal Auto Personal L1 Offer2 Branch 753.760098 Luxury SUV Small
8825 US30122 California 61134.68307 No Basic College 2/28/11 Unemployed M 0 75 0 2 Corporate Auto Corporate L3 Offer2 Branch 2275.265075 Luxury Car Medsize

20 rows × 24 columns

In [11]:

outliers.info()

Text Description automatically generated

Looks like there are only 20 rows of the 9134 rows that have a lifetime value of more than 50000. We will leave this as is for now

Handling missing values

Let’s continue with handling the missing values in this dataset. Let’s see where and how many missing values there are in this dataset.

In [12]:

#let’s look in what columns there are missing values

data.isnull().sum().sort_values(ascending = False)

Out[12]:

Text Description automatically generated

There seem to be no missing values in this dataset.

Making the text columns Numeric

We first need to make all column input numeric to use them further on. This is what I will do now.

In [13]:

#First we drop the customer column, as this is a unique identifier and will bias the model

data = data.drop(labels = [‘Customer’], axis = 1)

In [14]:

#let’s load the required packages

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

In [15]:

# Let's transform the categorical variables to continous variables
column_names = ['Response', 'Coverage', 'Education', 'Effective To Date', 'EmploymentStatus',
'Gender', 'Location Code', 'Marital Status',
'Policy Type', 'Policy', 'Renew Offer Type',
'Sales Channel', 'Vehicle Class', 'Vehicle Size', 'State']
for col in column_names:
     data[col] = le.fit_transform(data[col])
data.head()

Out[15]:

State Customer Lifetime Value Response Coverage Education Effective To Date EmploymentStatus Gender Income Location Code Months Since Policy Inception Number of Open Complaints Number of Policies Policy Type Policy Renew Offer Type Sales Channel Total Claim Amount Vehicle Class Vehicle Size
0 4 2763.519279 0 0 0 47 1 0 56274 1 5 0 1 0 2 0 0 384.811147 5 1
1 0 6979.535903 0 1 0 24 4 0 0 1 42 0 8 1 5 2 0 1131.464935 0 1
2 2 12887.431650 0 2 0 41 1 0 48767 1 38 0 2 1 5 0 0 566.472247 5 1
3 1 7645.861827 0 0 0 12 4 1 0 1 65 0 7 0 1 0 2 529.881344 3 1
4 4 2813.692575 0 0 0 52 1 1 43836 0 44 0 1 1 3 0 0 138.130879 0 1

5 rows × 23 columns

In [16]:

data.dtypes

Out[16]:

Text Description automatically generated

As my model can not handle floats, we will change these to integers.

In [17]:

data['Customer Lifetime Value'] = data['Customer Lifetime Value'].astype(int)
data['Total Claim Amount'] = data['Total Claim Amount'].astype(int)

Most important features

Let’s continue by looking at the most important features according to two different tests. Than we will use the top ones to train and test our first model.

In [18]:

#First we need to split the dataset in the y-column (the target) and the components (X), the independent columns.
#This is needed as we need to use the X columns to predict the y in the model.
y = data['Customer Lifetime Value'] #the column we want to predict
X = data.drop(labels = ['Customer Lifetime Value'], axis = 1) #independent columns


In [19]:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Name of the column','Score'] #naming the dataframe columns
print(featureScores.nlargest(10,'Score')) #print 10 best features

Text Description automatically generated

In [20]:

#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,10))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

Timeline Description automatically generated

What pop’s out when looking at the correlations for the CLV is the column ‘Monthly Premium Auto’ and the ‘Total Claim Amount’ These might be the best features to use.

Seems that the feature selection models differ a bit in which feature is the most important. For the first test I will keep:

  • Total Claim Amount (high in all both tests)
  • Monthly Premium Auto (high in all both tests and the highest in the correlation)
  • Income (high in two tests)
  • Months Since Policy Inception (High in the best features test)
  • Coverage (High in the correlation)

Machine learning Model

We want to predict a continous number, therefore we need a linear regression model.

In [21]:

from sklearn.linear_model import LinearRegression

Split the dataset in train and test

Before we are going to use the model choosen, we will first split the dataset in a train and test set. This because we want to test the performance of the model on the training set and to be able to check it’s accuracy.

In [22]:

from sklearn.model_selection import train_test_split
#First try with the 5 most important features
X_5 = data[['Total Claim Amount', 'Monthly Premium Auto', 'Income', 'Coverage', 'Months Since Policy Inception']] #independent columns chosen
y = data['Customer Lifetime Value'] #target column
#I want to withhold 30 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X_5,y, test_size=0.3 , random_state = 25)

In [23]:

print('Shape of X_train is: ', X_train.shape)
print('Shape of X_test is: ', X_test.shape)
print('Shape of Y_train is: ', y_train.shape)
print('Shape of y_test is: ', y_test.shape)

Shape Description automatically generated with medium confidence

In [24]:

#To check the model, I want to build a check:
import math
def print_metrics(y_true, y_predicted, n_parameters):
## First compute R^2 and the adjusted R^2
r2 = sklm.r2_score(y_true, y_predicted)
r2_adj = r2 - (n_parameters - 1)/(y_true.shape[0] - n_parameters) * (1 - r2)
## Print the usual metrics and the R^2 values
print('Mean Square Error = ' + str(sklm.mean_squared_error(y_true, y_predicted)))
print('Root Mean Square Error = ' + str(math.sqrt(sklm.mean_squared_error(y_true, y_predicted))))
print('Mean Absolute Error = ' + str(sklm.mean_absolute_error(y_true, y_predicted)))
print('Median Absolute Error = ' + str(sklm.median_absolute_error(y_true, y_predicted)))
print('R^2 = ' + str(r2))
print('Adjusted R^2 = ' + str(r2_adj))

Linear Regression on 5 features

Let’s try the model

In [25]:

# Linear regression model

model_5 = LinearRegression()

model_5.fit(X_train, y_train)

Out[25]:

LinearRegression()

In [26]:

Predictions = model_5.predict(X_test)

print_metrics(y_test, Predictions, 5)

Graphical user interface, text Description automatically generated

Hmmm, that is not a good result, just over 14% reliable…

Linear Regression on all

Let’s try the model on all features to see if this improves

In [27]:

#I want to withhold 30 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3 , random_state = 25)
print('Shape of X_train is: ', X_train.shape)
print('Shape of X_test is: ', X_test.shape)
print('Shape of Y_train is: ', y_train.shape)
print('Shape of y_test is: ', y_test.shape)

Shape Description automatically generated with medium confidence

In [28]:

# Linear regression model

model = LinearRegression()

model.fit(X_train, y_train)

Out[28]:

LinearRegression()

In [29]:

Predictions = model.predict(X_test)

print_metrics(y_test, Predictions, 22)

Graphical user interface, text Description automatically generated

This is even worse.

Conclusion

This model does not perform well to predict the CLV, as the CLV data is highly skewed. To improve the prediction, we could try to normalize the distribution of the CLV column. I will try this here below using Box Cox and Log (two different methods)

In [30]:

#to see the CLV data as is (without having the extremes removed)

data.hist(‘Customer Lifetime Value’, bins = 10)

plt.show()

Chart, histogram Description automatically generated

In [31]:

#Chech the skewness, if p < 0.05 it is skewed

clv = data[‘Customer Lifetime Value’]

from scipy.stats import shapiro

shapiro(clv)[1]

/opt/conda/lib/python3.7/site-packages/scipy/stats/morestats.py:1676: UserWarning: p-value may not be accurate for N > 5000.

warnings.warn(“p-value may not be accurate for N > 5000.”)

Out[31]:

0.0

In [32]:

#as this does not work, let’s continue with the log function

log_clv = np.log(clv)

import seaborn as sns

sns.distplot(log_clv)

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff0ec64ca90>

Chart, histogram Description automatically generated

In [33]:

print('Shape of X_train is: ', X_train.shape)
print('Shape of X_test is: ', X_test.shape)
print('Shape of Y_train is: ', y_train.shape)
print('Shape of y_test is: ', y_test.shape)

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff0ec597ed0>

Chart, histogram Description automatically generated

BoxCox improved the normal distribution a bit better. Let’s try our linear regression now.

In [34]:

#I want to withhold 30 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X_5,boxcox_clv, test_size=0.3 , random_state = 25)

In [35]:

 model_5.fit(X_train, y_train)

Out[35]:

LinearRegression()

In [36]:

Predictions_box = model_5.predict(X_test)
print_metrics(y_test, Predictions_box, 5)

Graphical user interface, text Description automatically generated

We can see a slight improvement to 18,5% now. But we need to do further feature improvement to better the result.

Source: https://www.kaggle.com/renatevankempen/predicting-clv

You may also like

Read More