{"id":1673,"date":"2021-12-28T10:23:54","date_gmt":"2021-12-28T03:23:54","guid":{"rendered":"https:\/\/mintea.blog\/?p=1673"},"modified":"2021-12-28T11:23:44","modified_gmt":"2021-12-28T04:23:44","slug":"1673","status":"publish","type":"post","link":"https:\/\/mintea.blog\/?p=1673","title":{"rendered":"Predicting CLV (demo Python)"},"content":{"rendered":"<p><strong>Predicting CLV (demo Python)<\/strong><\/p>\n<p>Customer Lifetime Value prediction<\/p>\n<p>The problem<\/p>\n<p>In this notebook we look at the data we got via this\u00a0<a href=\"https:\/\/drive.google.com\/drive\/folders\/12CuCcULCRP3Y59nBjGDjkNOQogjj9EQb?usp=sharing\">Kaggle dataset (CreditCard_dataset)<\/a>. It involves the car insurance customer lifetime value.<\/p>\n<p>Customer Lifetime Value Prediction( CLV ) value refers to net profit attributed to the entire future relationship with a customer. A bank will use different predictive analytic approaches to predict the revenue that can be generated from any customer in the future. This helps the banks in segmentating the customers in specific groups based on their CLV.<\/p>\n<p>Identifying customers with high future values will enable the organization to keep maintaining good relationships with such customers. It can be done by investing more time and resources on them such as better prices, offers, discounts, customer care services, etc.<\/p>\n<p>Finding and engaging reliable and profitable customers has always been a great challenge for banks. With the increasing competition, the banks need to keep a check on each and every activity of their customers for utilizing their resources effectively.<\/p>\n<p>To solve this problem, Data Science in banking is being used for extracting actionable insights concerning customer behaviors and expectations. Using Data Science models for predicting the CLV of a customer will help a bank to take some suitable decisions for their growth and profit.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"461\" height=\"310\" class=\"wp-image-1675\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-1.png\" alt=\"CLV\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-1.png 461w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-1-300x202.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-1-150x100.png 150w\" sizes=\"auto, (max-width: 461px) 100vw, 461px\" \/><\/p>\n<p>Import the important libraries \/ packages<\/p>\n<p>These packages are needed to load and use the dataset<\/p>\n<p><strong>In\u00a0[1]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pandas<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pd<\/span> <span style=\"color: #888888;\">#we use this to load, read and transform the dataset<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">numpy<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">np<\/span> <span style=\"color: #888888;\">#we use this for statistical analysis<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">matplotlib.pyplot<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">plt<\/span> <span style=\"color: #888888;\">#we use this to visualize the dataset<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">seaborn<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sns<\/span> <span style=\"color: #888888;\">#we use this to make countplots<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.metrics<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklm<\/span> <span style=\"color: #888888;\">#This is to test the models<\/span>\r\n<\/pre>\n<\/div>\n<p>Load and explore the dataset<\/p>\n<p>The data is all in one csv file. In this next step I will first load the data to see how this looks like<\/p>\n<p><strong>In\u00a0[2]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#here we load the data<\/span>\r\ndata <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>read_csv(<span style=\"background-color: #fff0f0;\">'\/kaggle\/input\/credit-card-data\/Fn-UseC_-Marketing-Customer-Value-Analysis.csv'<\/span>)\r\n<span style=\"color: #888888;\">#and immediately I would like to see how this dataset looks like<\/span>\r\ndata<span style=\"color: #333333;\">.<\/span>head()\r\n<\/pre>\n<\/div>\n<p><strong>Out[2]:<\/strong><\/p>\n<div style=\"overflow-x: auto;\">\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>Customer<\/th>\n<th>State<\/th>\n<th>Customer Lifetime Value<\/th>\n<th>Response<\/th>\n<th>Coverage<\/th>\n<th>Education<\/th>\n<th>Effective To Date<\/th>\n<th>EmploymentStatus<\/th>\n<th>Gender<\/th>\n<th>Income<\/th>\n<th>&#8230;<\/th>\n<th>Months Since Policy Inception<\/th>\n<th>Number of Open Complaints<\/th>\n<th>Number of Policies<\/th>\n<th>Policy Type<\/th>\n<th>Policy<\/th>\n<th>Renew Offer Type<\/th>\n<th>Sales Channel<\/th>\n<th>Total Claim Amount<\/th>\n<th>Vehicle Class<\/th>\n<th>Vehicle Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>0<\/td>\n<td>BU79786<\/td>\n<td>Washington<\/td>\n<td>2763.519279<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>Bachelor<\/td>\n<td>2\/24\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>56274<\/td>\n<td>&#8230;<\/td>\n<td>5<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>Corporate Auto<\/td>\n<td>Corporate L3<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>384.811147<\/td>\n<td>Two-Door Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>QZ44356<\/td>\n<td>Arizona<\/td>\n<td>6979.535903<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>Bachelor<\/td>\n<td>1\/31\/11<\/td>\n<td>Unemployed<\/td>\n<td>F<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>42<\/td>\n<td>0<\/td>\n<td>8<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer3<\/td>\n<td>Agent<\/td>\n<td>1131.464935<\/td>\n<td>Four-Door Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>AI49188<\/td>\n<td>Nevada<\/td>\n<td>12887.431650<\/td>\n<td>No<\/td>\n<td>Premium<\/td>\n<td>Bachelor<\/td>\n<td>2\/19\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>48767<\/td>\n<td>&#8230;<\/td>\n<td>38<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>566.472247<\/td>\n<td>Two-Door Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>WW63253<\/td>\n<td>California<\/td>\n<td>7645.861827<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>Bachelor<\/td>\n<td>1\/20\/11<\/td>\n<td>Unemployed<\/td>\n<td>M<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>65<\/td>\n<td>0<\/td>\n<td>7<\/td>\n<td>Corporate Auto<\/td>\n<td>Corporate L2<\/td>\n<td>Offer1<\/td>\n<td>Call Center<\/td>\n<td>529.881344<\/td>\n<td>SUV<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>HB64268<\/td>\n<td>Washington<\/td>\n<td>2813.692575<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>Bachelor<\/td>\n<td>2\/3\/11<\/td>\n<td>Employed<\/td>\n<td>M<\/td>\n<td>43836<\/td>\n<td>&#8230;<\/td>\n<td>44<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L1<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>138.130879<\/td>\n<td>Four-Door Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>5 rows \u00d7 24 columns<\/p>\n<p><strong>In\u00a0[3]:<\/strong><\/p>\n<p><em>#now let&#8217;s look closer at the dataset we got<\/em><\/p>\n<p><a id=\"post-1673-kln-12\"><\/a> data.info()<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"753\" height=\"753\" class=\"wp-image-1706\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-2.png\" alt=\"Text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-2.png 753w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-2-300x300.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-2-150x150.png 150w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-2-100x100.png 100w\" sizes=\"auto, (max-width: 753px) 100vw, 753px\" \/><\/p>\n<p>It seems that we have a lot of text \/ category information (these are of the Dtype &#8216;object&#8217;) and a few numerical columns (Dtypes &#8216;int64&#8217; and &#8216;float64&#8217;).<\/p>\n<p>The column &#8216;Customer Lifetime Value&#8217; is the column we would like to predict.<\/p>\n<p><strong>In\u00a0[4]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-13\"><\/a> data.shape<\/p>\n<p><strong>Out[4]:<\/strong><\/p>\n<p>(9134, 24)<\/p>\n<p>The dataset consists of 9134 rows and 24 columns.<\/p>\n<p><strong>In\u00a0[5]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-14\"><\/a> data.describe()<\/p>\n<p><strong>Out[5]:<\/strong><\/p>\n<div style=\"overflow-x: auto;\">\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>Customer Lifetime Value<\/th>\n<th>Income<\/th>\n<th>Monthly Premium Auto<\/th>\n<th>Months Since Last Claim<\/th>\n<th>Months Since Policy Inception<\/th>\n<th>Number of Open Complaints<\/th>\n<th>Number of Policies<\/th>\n<th>Total Claim Amount<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>count<\/td>\n<td>9134.000000<\/td>\n<td>9134.000000<\/td>\n<td>9134.000000<\/td>\n<td>9134.000000<\/td>\n<td>9134.000000<\/td>\n<td>9134.000000<\/td>\n<td>9134.000000<\/td>\n<td>9134.000000<\/td>\n<\/tr>\n<tr>\n<td>mean<\/td>\n<td>8004.940475<\/td>\n<td>37657.380009<\/td>\n<td>93.219291<\/td>\n<td>15.097000<\/td>\n<td>48.064594<\/td>\n<td>0.384388<\/td>\n<td>2.966170<\/td>\n<td>434.088794<\/td>\n<\/tr>\n<tr>\n<td>std<\/td>\n<td>6870.967608<\/td>\n<td>30379.904734<\/td>\n<td>34.407967<\/td>\n<td>10.073257<\/td>\n<td>27.905991<\/td>\n<td>0.910384<\/td>\n<td>2.390182<\/td>\n<td>290.500092<\/td>\n<\/tr>\n<tr>\n<td>min<\/td>\n<td>1898.007675<\/td>\n<td>0.000000<\/td>\n<td>61.000000<\/td>\n<td>0.000000<\/td>\n<td>0.000000<\/td>\n<td>0.000000<\/td>\n<td>1.000000<\/td>\n<td>0.099007<\/td>\n<\/tr>\n<tr>\n<td>25%<\/td>\n<td>3994.251794<\/td>\n<td>0.000000<\/td>\n<td>68.000000<\/td>\n<td>6.000000<\/td>\n<td>24.000000<\/td>\n<td>0.000000<\/td>\n<td>1.000000<\/td>\n<td>272.258244<\/td>\n<\/tr>\n<tr>\n<td>50%<\/td>\n<td>5780.182197<\/td>\n<td>33889.500000<\/td>\n<td>83.000000<\/td>\n<td>14.000000<\/td>\n<td>48.000000<\/td>\n<td>0.000000<\/td>\n<td>2.000000<\/td>\n<td>383.945434<\/td>\n<\/tr>\n<tr>\n<td>75%<\/td>\n<td>8962.167041<\/td>\n<td>62320.000000<\/td>\n<td>109.000000<\/td>\n<td>23.000000<\/td>\n<td>71.000000<\/td>\n<td>0.000000<\/td>\n<td>4.000000<\/td>\n<td>547.514839<\/td>\n<\/tr>\n<tr>\n<td>max<\/td>\n<td>83325.381190<\/td>\n<td>99981.000000<\/td>\n<td>298.000000<\/td>\n<td>35.000000<\/td>\n<td>99.000000<\/td>\n<td>5.000000<\/td>\n<td>9.000000<\/td>\n<td>2893.239678<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>It seems that we have some strange outliers for the CLV and claim amounts. We will look and handle these later on.<\/p>\n<p><strong>In\u00a0[6]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-15\"><\/a> data.describe(include=&#8217;O&#8217;)<\/p>\n<p><strong>Out[6]:<\/strong><\/p>\n<div style=\"overflow-x: auto;\">\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>Customer<\/th>\n<th>State<\/th>\n<th>Response<\/th>\n<th>Coverage<\/th>\n<th>Education<\/th>\n<th>Effective To Date<\/th>\n<th>EmploymentStatus<\/th>\n<th>Gender<\/th>\n<th>Location Code<\/th>\n<th>Marital Status<\/th>\n<th>Policy Type<\/th>\n<th>Policy<\/th>\n<th>Renew Offer Type<\/th>\n<th>Sales Channel<\/th>\n<th>Vehicle Class<\/th>\n<th>Vehicle Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>count<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<td>9134<\/td>\n<\/tr>\n<tr>\n<td>unique<\/td>\n<td>9134<\/td>\n<td>5<\/td>\n<td>2<\/td>\n<td>3<\/td>\n<td>5<\/td>\n<td>59<\/td>\n<td>5<\/td>\n<td>2<\/td>\n<td>3<\/td>\n<td>3<\/td>\n<td>3<\/td>\n<td>9<\/td>\n<td>4<\/td>\n<td>4<\/td>\n<td>6<\/td>\n<td>3<\/td>\n<\/tr>\n<tr>\n<td>top<\/td>\n<td>YD27780<\/td>\n<td>California<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>Bachelor<\/td>\n<td>1\/10\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>Suburban<\/td>\n<td>Married<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>Four-Door Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>freq<\/td>\n<td>1<\/td>\n<td>3150<\/td>\n<td>7826<\/td>\n<td>5568<\/td>\n<td>2748<\/td>\n<td>195<\/td>\n<td>5698<\/td>\n<td>4658<\/td>\n<td>5779<\/td>\n<td>5298<\/td>\n<td>6788<\/td>\n<td>3426<\/td>\n<td>3752<\/td>\n<td>3477<\/td>\n<td>4621<\/td>\n<td>6424<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p><strong>In\u00a0[7]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#Let's see what the options are in the text columns with two or three options (the objects)<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Response: '<\/span><span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Response'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Coverage: '<\/span><span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Coverage'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Education: '<\/span><span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Education'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Employment Status: '<\/span><span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'EmploymentStatus'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Gender: '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Gender'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Location Code: '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Location Code'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Married: '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Marital Status'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Policy Type: '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Policy Type'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Vehicle Size: '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(data[<span style=\"background-color: #fff0f0;\">'Vehicle Size'<\/span>]<span style=\"color: #333333;\">.<\/span>unique()))\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"791\" height=\"235\" class=\"wp-image-1708\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-3.png\" alt=\"Text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-3.png 791w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-3-300x89.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-3-768x228.png 768w\" sizes=\"auto, (max-width: 791px) 100vw, 791px\" \/><\/p>\n<p>Customer Lifetime Value<\/p>\n<p>As Customer Lifetime Value is the column we want to predict, let&#8217;s explore this column in the training dataset.<\/p>\n<p>The formula to calculate the CLV:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1430\" height=\"881\" class=\"wp-image-1676\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-formula.jpeg\" alt=\"CLV formula\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-formula.jpeg 1430w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-formula-300x185.jpeg 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-formula-1024x631.jpeg 1024w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-formula-768x473.jpeg 768w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/clv-formula-180x110.jpeg 180w\" sizes=\"auto, (max-width: 1430px) 100vw, 1430px\" \/><\/p>\n<p><strong>In\u00a0[8]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#As this is a numeric, thus continous number, I will use a scatterplot to see if there is a pattern.<\/span>\r\nplt<span style=\"color: #333333;\">.<\/span>hist(data[<span style=\"background-color: #fff0f0;\">'Customer Lifetime Value'<\/span>], bins <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">10<\/span>)\r\nplt<span style=\"color: #333333;\">.<\/span>title(<span style=\"background-color: #fff0f0;\">\"Customer Lifetime Value\"<\/span>) <span style=\"color: #888888;\">#Assign title<\/span>\r\nplt<span style=\"color: #333333;\">.<\/span>xlabel(<span style=\"background-color: #fff0f0;\">\"Value\"<\/span>) <span style=\"color: #888888;\">#Assign x label<\/span>\r\nplt<span style=\"color: #333333;\">.<\/span>ylabel(<span style=\"background-color: #fff0f0;\">\"Customers\"<\/span>) <span style=\"color: #888888;\">#Assign y label<\/span>\r\nplt<span style=\"color: #333333;\">.<\/span>show()\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"395\" height=\"278\" class=\"wp-image-1677\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-6.png\" alt=\"Chart, histogram Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-6.png 395w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-6-300x211.png 300w\" sizes=\"auto, (max-width: 395px) 100vw, 395px\" \/><\/p>\n<p><strong>In\u00a0[9]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-32\"><\/a> plt.boxplot(data[&#8216;Customer Lifetime Value&#8217;])<\/p>\n<p><strong>Out[9]:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"771\" height=\"219\" class=\"wp-image-1709\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-4.png\" alt=\"Text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-4.png 771w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-4-300x85.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-4-768x218.png 768w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"388\" height=\"248\" class=\"wp-image-1678\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-box-and-whisker-chart-descri.png\" alt=\"A picture containing box and whisker chart Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-box-and-whisker-chart-descri.png 388w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-box-and-whisker-chart-descri-300x192.png 300w\" sizes=\"auto, (max-width: 388px) 100vw, 388px\" \/><\/p>\n<p><strong>In\u00a0[10]:<\/strong><\/p>\n<p><em>#We see that there are some great outliers here.<\/em><\/p>\n<p><em>#let&#8217;s look closer to these outliers over 50000<\/em><\/p>\n<p><a id=\"post-1673-kln-35\"><\/a> outliers = data[data[&#8216;Customer Lifetime Value&#8217;] &gt; 50000]<\/p>\n<p><a id=\"post-1673-kln-36\"><\/a> outliers.head(25)<\/p>\n<p><strong>Out[10]:<\/strong><\/p>\n<div style=\"overflow-x: auto;\">\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>Customer<\/th>\n<th>State<\/th>\n<th>Customer Lifetime Value<\/th>\n<th>Response<\/th>\n<th>Coverage<\/th>\n<th>Education<\/th>\n<th>Effective To Date<\/th>\n<th>EmploymentStatus<\/th>\n<th>Gender<\/th>\n<th>Income<\/th>\n<th>&#8230;<\/th>\n<th>Months Since Policy Inception<\/th>\n<th>Number of Open Complaints<\/th>\n<th>Number of Policies<\/th>\n<th>Policy Type<\/th>\n<th>Policy<\/th>\n<th>Renew Offer Type<\/th>\n<th>Sales Channel<\/th>\n<th>Total Claim Amount<\/th>\n<th>Vehicle Class<\/th>\n<th>Vehicle Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>79<\/td>\n<td>OM82309<\/td>\n<td>California<\/td>\n<td>58166.55351<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>Bachelor<\/td>\n<td>2\/27\/11<\/td>\n<td>Employed<\/td>\n<td>M<\/td>\n<td>61321<\/td>\n<td>&#8230;<\/td>\n<td>30<\/td>\n<td>1<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer2<\/td>\n<td>Branch<\/td>\n<td>427.631210<\/td>\n<td>Luxury Car<\/td>\n<td>Small<\/td>\n<\/tr>\n<tr>\n<td>1974<\/td>\n<td>YC54142<\/td>\n<td>Washington<\/td>\n<td>74228.51604<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>High School or Below<\/td>\n<td>1\/26\/11<\/td>\n<td>Unemployed<\/td>\n<td>M<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>34<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L1<\/td>\n<td>Offer1<\/td>\n<td>Branch<\/td>\n<td>1742.400000<\/td>\n<td>Luxury Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>2190<\/td>\n<td>KI58952<\/td>\n<td>California<\/td>\n<td>51337.90677<\/td>\n<td>No<\/td>\n<td>Premium<\/td>\n<td>College<\/td>\n<td>2\/24\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>72794<\/td>\n<td>&#8230;<\/td>\n<td>47<\/td>\n<td>1<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L2<\/td>\n<td>Offer1<\/td>\n<td>Web<\/td>\n<td>50.454459<\/td>\n<td>SUV<\/td>\n<td>Large<\/td>\n<\/tr>\n<tr>\n<td>2908<\/td>\n<td>EN65835<\/td>\n<td>Arizona<\/td>\n<td>58753.88046<\/td>\n<td>No<\/td>\n<td>Premium<\/td>\n<td>Bachelor<\/td>\n<td>1\/6\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>24964<\/td>\n<td>&#8230;<\/td>\n<td>84<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L2<\/td>\n<td>Offer2<\/td>\n<td>Agent<\/td>\n<td>888.000000<\/td>\n<td>SUV<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>3145<\/td>\n<td>CL79250<\/td>\n<td>Nevada<\/td>\n<td>52811.49112<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>Bachelor<\/td>\n<td>1\/8\/11<\/td>\n<td>Unemployed<\/td>\n<td>M<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>70<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Corporate Auto<\/td>\n<td>Corporate L2<\/td>\n<td>Offer2<\/td>\n<td>Agent<\/td>\n<td>873.600000<\/td>\n<td>Luxury Car<\/td>\n<td>Small<\/td>\n<\/tr>\n<tr>\n<td>3760<\/td>\n<td>AZ84403<\/td>\n<td>Oregon<\/td>\n<td>61850.18803<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>College<\/td>\n<td>2\/4\/11<\/td>\n<td>Unemployed<\/td>\n<td>F<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>29<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L1<\/td>\n<td>Offer3<\/td>\n<td>Branch<\/td>\n<td>1142.400000<\/td>\n<td>Luxury SUV<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>4126<\/td>\n<td>JT47995<\/td>\n<td>Arizona<\/td>\n<td>60556.19213<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>College<\/td>\n<td>1\/1\/11<\/td>\n<td>Unemployed<\/td>\n<td>F<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>45<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Web<\/td>\n<td>979.200000<\/td>\n<td>Luxury SUV<\/td>\n<td>Large<\/td>\n<\/tr>\n<tr>\n<td>4915<\/td>\n<td>DU50092<\/td>\n<td>Oregon<\/td>\n<td>56675.93768<\/td>\n<td>No<\/td>\n<td>Premium<\/td>\n<td>College<\/td>\n<td>1\/24\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>77237<\/td>\n<td>&#8230;<\/td>\n<td>93<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L1<\/td>\n<td>Offer4<\/td>\n<td>Web<\/td>\n<td>1358.400000<\/td>\n<td>Luxury SUV<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>5279<\/td>\n<td>SK66747<\/td>\n<td>Washington<\/td>\n<td>66025.75407<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>Bachelor<\/td>\n<td>2\/22\/11<\/td>\n<td>Employed<\/td>\n<td>M<\/td>\n<td>33481<\/td>\n<td>&#8230;<\/td>\n<td>46<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>1194.892002<\/td>\n<td>Luxury SUV<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>5716<\/td>\n<td>FQ61281<\/td>\n<td>Oregon<\/td>\n<td>83325.38119<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>High School or Below<\/td>\n<td>1\/31\/11<\/td>\n<td>Employed<\/td>\n<td>M<\/td>\n<td>58958<\/td>\n<td>&#8230;<\/td>\n<td>74<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Call Center<\/td>\n<td>1108.800000<\/td>\n<td>Luxury Car<\/td>\n<td>Small<\/td>\n<\/tr>\n<tr>\n<td>6252<\/td>\n<td>BP23267<\/td>\n<td>California<\/td>\n<td>73225.95652<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>Bachelor<\/td>\n<td>2\/9\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>39547<\/td>\n<td>&#8230;<\/td>\n<td>21<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Branch<\/td>\n<td>969.600000<\/td>\n<td>Luxury SUV<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>6461<\/td>\n<td>OY68395<\/td>\n<td>Oregon<\/td>\n<td>55277.44589<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>High School or Below<\/td>\n<td>1\/30\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>40740<\/td>\n<td>&#8230;<\/td>\n<td>60<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L2<\/td>\n<td>Offer1<\/td>\n<td>Web<\/td>\n<td>950.400000<\/td>\n<td>Luxury SUV<\/td>\n<td>Large<\/td>\n<\/tr>\n<tr>\n<td>6554<\/td>\n<td>AH58807<\/td>\n<td>Arizona<\/td>\n<td>51426.24815<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>College<\/td>\n<td>1\/9\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>84650<\/td>\n<td>&#8230;<\/td>\n<td>39<\/td>\n<td>3<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L2<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>660.474274<\/td>\n<td>Luxury Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>6569<\/td>\n<td>LW64678<\/td>\n<td>California<\/td>\n<td>51016.06704<\/td>\n<td>No<\/td>\n<td>Premium<\/td>\n<td>Master<\/td>\n<td>2\/19\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>25167<\/td>\n<td>&#8230;<\/td>\n<td>76<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer2<\/td>\n<td>Agent<\/td>\n<td>422.494292<\/td>\n<td>SUV<\/td>\n<td>Small<\/td>\n<\/tr>\n<tr>\n<td>6584<\/td>\n<td>XF89906<\/td>\n<td>Arizona<\/td>\n<td>58207.12842<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>High School or Below<\/td>\n<td>1\/13\/11<\/td>\n<td>Disabled<\/td>\n<td>M<\/td>\n<td>29295<\/td>\n<td>&#8230;<\/td>\n<td>50<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>1328.839129<\/td>\n<td>Luxury SUV<\/td>\n<td>Large<\/td>\n<\/tr>\n<tr>\n<td>7283<\/td>\n<td>KH55886<\/td>\n<td>Oregon<\/td>\n<td>67907.27050<\/td>\n<td>No<\/td>\n<td>Premium<\/td>\n<td>Bachelor<\/td>\n<td>2\/5\/11<\/td>\n<td>Employed<\/td>\n<td>M<\/td>\n<td>78310<\/td>\n<td>&#8230;<\/td>\n<td>18<\/td>\n<td>1<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L1<\/td>\n<td>Offer1<\/td>\n<td>Agent<\/td>\n<td>151.711475<\/td>\n<td>Sports Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>7303<\/td>\n<td>FB95288<\/td>\n<td>California<\/td>\n<td>64618.75715<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>High School or Below<\/td>\n<td>1\/17\/11<\/td>\n<td>Unemployed<\/td>\n<td>M<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>40<\/td>\n<td>1<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer1<\/td>\n<td>Branch<\/td>\n<td>1562.400000<\/td>\n<td>Luxury Car<\/td>\n<td>Small<\/td>\n<\/tr>\n<tr>\n<td>7556<\/td>\n<td>JZ23377<\/td>\n<td>Oregon<\/td>\n<td>57520.50151<\/td>\n<td>No<\/td>\n<td>Premium<\/td>\n<td>College<\/td>\n<td>1\/20\/11<\/td>\n<td>Employed<\/td>\n<td>F<\/td>\n<td>48367<\/td>\n<td>&#8230;<\/td>\n<td>34<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L3<\/td>\n<td>Offer2<\/td>\n<td>Branch<\/td>\n<td>772.800000<\/td>\n<td>SUV<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<tr>\n<td>7835<\/td>\n<td>QT84069<\/td>\n<td>Oregon<\/td>\n<td>50568.25912<\/td>\n<td>No<\/td>\n<td>Extended<\/td>\n<td>Master<\/td>\n<td>2\/28\/11<\/td>\n<td>Employed<\/td>\n<td>M<\/td>\n<td>82081<\/td>\n<td>&#8230;<\/td>\n<td>62<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Personal Auto<\/td>\n<td>Personal L1<\/td>\n<td>Offer2<\/td>\n<td>Branch<\/td>\n<td>753.760098<\/td>\n<td>Luxury SUV<\/td>\n<td>Small<\/td>\n<\/tr>\n<tr>\n<td>8825<\/td>\n<td>US30122<\/td>\n<td>California<\/td>\n<td>61134.68307<\/td>\n<td>No<\/td>\n<td>Basic<\/td>\n<td>College<\/td>\n<td>2\/28\/11<\/td>\n<td>Unemployed<\/td>\n<td>M<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>75<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>Corporate Auto<\/td>\n<td>Corporate L3<\/td>\n<td>Offer2<\/td>\n<td>Branch<\/td>\n<td>2275.265075<\/td>\n<td>Luxury Car<\/td>\n<td>Medsize<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>20 rows \u00d7 24 columns<\/p>\n<p><strong>In\u00a0[11]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-37\"><\/a> outliers.info()<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"780\" height=\"749\" class=\"wp-image-1710\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-5.png\" alt=\"Text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-5.png 780w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-5-300x288.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-5-768x737.png 768w\" sizes=\"auto, (max-width: 780px) 100vw, 780px\" \/><\/p>\n<p>Looks like there are only 20 rows of the 9134 rows that have a lifetime value of more than 50000. We will leave this as is for now<\/p>\n<p>Handling missing values<\/p>\n<p>Let&#8217;s continue with handling the missing values in this dataset. Let&#8217;s see where and how many missing values there are in this dataset.<\/p>\n<p><strong>In\u00a0[12]:<\/strong><\/p>\n<p><em>#let&#8217;s look in what columns there are missing values<\/em><\/p>\n<p><a id=\"post-1673-kln-39\"><\/a> data.isnull().sum().sort_values(ascending = False)<\/p>\n<p><strong>Out[12]:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"796\" height=\"609\" class=\"wp-image-1711\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-6.png\" alt=\"Text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-6.png 796w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-6-300x230.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-6-768x588.png 768w\" sizes=\"auto, (max-width: 796px) 100vw, 796px\" \/><\/p>\n<p>There seem to be no missing values in this dataset.<\/p>\n<p>Making the text columns Numeric<\/p>\n<p>We first need to make all column input numeric to use them further on. This is what I will do now.<\/p>\n<p><strong>In\u00a0[13]:<\/strong><\/p>\n<p><em>#First we drop the customer column, as this is a unique identifier and will bias the model<\/em><\/p>\n<p><a id=\"post-1673-kln-41\"><\/a> data = data.drop(labels = [&#8216;Customer&#8217;], axis = 1)<\/p>\n<p><strong>In\u00a0[14]:<\/strong><\/p>\n<p><em>#let&#8217;s load the required packages<\/em><\/p>\n<p>from sklearn.preprocessing import <a id=\"post-1673-kln-43\"><\/a>LabelEncoder<\/p>\n<p><a id=\"post-1673-kln-44\"><\/a> le = LabelEncoder()<\/p>\n<p><strong>In\u00a0[15]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Let's transform the categorical variables to continous variables<\/span>\r\ncolumn_names <span style=\"color: #333333;\">=<\/span> [<span style=\"background-color: #fff0f0;\">'Response'<\/span>, <span style=\"background-color: #fff0f0;\">'Coverage'<\/span>, <span style=\"background-color: #fff0f0;\">'Education'<\/span>, <span style=\"background-color: #fff0f0;\">'Effective To Date'<\/span>, <span style=\"background-color: #fff0f0;\">'EmploymentStatus'<\/span>,\r\n<span style=\"background-color: #fff0f0;\">'Gender'<\/span>, <span style=\"background-color: #fff0f0;\">'Location Code'<\/span>, <span style=\"background-color: #fff0f0;\">'Marital Status'<\/span>,\r\n<span style=\"background-color: #fff0f0;\">'Policy Type'<\/span>, <span style=\"background-color: #fff0f0;\">'Policy'<\/span>, <span style=\"background-color: #fff0f0;\">'Renew Offer Type'<\/span>,\r\n<span style=\"background-color: #fff0f0;\">'Sales Channel'<\/span>, <span style=\"background-color: #fff0f0;\">'Vehicle Class'<\/span>, <span style=\"background-color: #fff0f0;\">'Vehicle Size'<\/span>, <span style=\"background-color: #fff0f0;\">'State'<\/span>]\r\n<span style=\"color: #008800; font-weight: bold;\">for<\/span> col <span style=\"color: #000000; font-weight: bold;\">in<\/span> column_names:\r\n     data[col] <span style=\"color: #333333;\">=<\/span> le<span style=\"color: #333333;\">.<\/span>fit_transform(data[col])\r\ndata<span style=\"color: #333333;\">.<\/span>head()\r\n<\/pre>\n<\/div>\n<p><strong>Out[15]:<\/strong><\/p>\n<div style=\"overflow-x: auto;\">\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>State<\/th>\n<th>Customer Lifetime Value<\/th>\n<th>Response<\/th>\n<th>Coverage<\/th>\n<th>Education<\/th>\n<th>Effective To Date<\/th>\n<th>EmploymentStatus<\/th>\n<th>Gender<\/th>\n<th>Income<\/th>\n<th>Location Code<\/th>\n<th>&#8230;<\/th>\n<th>Months Since Policy Inception<\/th>\n<th>Number of Open Complaints<\/th>\n<th>Number of Policies<\/th>\n<th>Policy Type<\/th>\n<th>Policy<\/th>\n<th>Renew Offer Type<\/th>\n<th>Sales Channel<\/th>\n<th>Total Claim Amount<\/th>\n<th>Vehicle Class<\/th>\n<th>Vehicle Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>0<\/td>\n<td>4<\/td>\n<td>2763.519279<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>47<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>56274<\/td>\n<td>1<\/td>\n<td>&#8230;<\/td>\n<td>5<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>384.811147<\/td>\n<td>5<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>0<\/td>\n<td>6979.535903<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>24<\/td>\n<td>4<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>&#8230;<\/td>\n<td>42<\/td>\n<td>0<\/td>\n<td>8<\/td>\n<td>1<\/td>\n<td>5<\/td>\n<td>2<\/td>\n<td>0<\/td>\n<td>1131.464935<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>2<\/td>\n<td>12887.431650<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>0<\/td>\n<td>41<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>48767<\/td>\n<td>1<\/td>\n<td>&#8230;<\/td>\n<td>38<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>1<\/td>\n<td>5<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>566.472247<\/td>\n<td>5<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>1<\/td>\n<td>7645.861827<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>12<\/td>\n<td>4<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>&#8230;<\/td>\n<td>65<\/td>\n<td>0<\/td>\n<td>7<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>2<\/td>\n<td>529.881344<\/td>\n<td>3<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>4<\/td>\n<td>2813.692575<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>52<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>43836<\/td>\n<td>0<\/td>\n<td>&#8230;<\/td>\n<td>44<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>3<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>138.130879<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>5 rows \u00d7 23 columns<\/p>\n<p><strong>In\u00a0[16]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-56\"><\/a> data.dtypes<\/p>\n<p><strong>Out[16]:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"585\" class=\"wp-image-1712\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-7.png\" alt=\"Text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-7.png 785w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-7-300x224.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-7-768x572.png 768w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/p>\n<p>As my model can not handle floats, we will change these to integers.<\/p>\n<p><strong>In\u00a0[17]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\">data[<span style=\"background-color: #fff0f0;\">'Customer Lifetime Value'<\/span>] <span style=\"color: #333333;\">=<\/span> data[<span style=\"background-color: #fff0f0;\">'Customer Lifetime Value'<\/span>]<span style=\"color: #333333;\">.<\/span>astype(<span style=\"color: #007020;\">int<\/span>)\r\ndata[<span style=\"background-color: #fff0f0;\">'Total Claim Amount'<\/span>] <span style=\"color: #333333;\">=<\/span> data[<span style=\"background-color: #fff0f0;\">'Total Claim Amount'<\/span>]<span style=\"color: #333333;\">.<\/span>astype(<span style=\"color: #007020;\">int<\/span>)\r\n<\/pre>\n<\/div>\n<p>Most important features<\/p>\n<p>Let&#8217;s continue by looking at the most important features according to two different tests. Than we will use the top ones to train and test our first model.<\/p>\n<p><strong>In\u00a0[18]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#First we need to split the dataset in the y-column (the target) and the components (X), the independent columns.<\/span>\r\n<span style=\"color: #888888;\">#This is needed as we need to use the X columns to predict the y in the model.<\/span>\r\ny <span style=\"color: #333333;\">=<\/span> data[<span style=\"background-color: #fff0f0;\">'Customer Lifetime Value'<\/span>] <span style=\"color: #888888;\">#the column we want to predict<\/span>\r\nX <span style=\"color: #333333;\">=<\/span> data<span style=\"color: #333333;\">.<\/span>drop(labels <span style=\"color: #333333;\">=<\/span> [<span style=\"background-color: #fff0f0;\">'Customer Lifetime Value'<\/span>], axis <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">1<\/span>) <span style=\"color: #888888;\">#independent columns<\/span>\r\n<\/pre>\n<\/div>\n<p><strong><br \/>\nIn\u00a0[19]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.feature_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> SelectKBest\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.feature_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> chi2\r\n<span style=\"color: #888888;\">#apply SelectKBest class to extract top 10 best features<\/span>\r\nbestfeatures <span style=\"color: #333333;\">=<\/span> SelectKBest(score_func<span style=\"color: #333333;\">=<\/span>chi2, k<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">'all'<\/span>)\r\nfit <span style=\"color: #333333;\">=<\/span> bestfeatures<span style=\"color: #333333;\">.<\/span>fit(X,y)\r\ndfscores <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>DataFrame(fit<span style=\"color: #333333;\">.<\/span>scores_)\r\ndfcolumns <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>DataFrame(X<span style=\"color: #333333;\">.<\/span>columns)\r\n<span style=\"color: #888888;\">#concat two dataframes for better visualization<\/span>\r\nfeatureScores <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>concat([dfcolumns,dfscores],axis<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>)\r\nfeatureScores<span style=\"color: #333333;\">.<\/span>columns <span style=\"color: #333333;\">=<\/span> [<span style=\"background-color: #fff0f0;\">'Name of the column'<\/span>,<span style=\"background-color: #fff0f0;\">'Score'<\/span>] <span style=\"color: #888888;\">#naming the dataframe columns<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(featureScores<span style=\"color: #333333;\">.<\/span>nlargest(<span style=\"color: #0000dd; font-weight: bold;\">10<\/span>,<span style=\"background-color: #fff0f0;\">'Score'<\/span>)) <span style=\"color: #888888;\">#print 10 best features<\/span>\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"806\" height=\"292\" class=\"wp-image-1713\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-8.png\" alt=\"Text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-8.png 806w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-8-300x109.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-8-768x278.png 768w\" sizes=\"auto, (max-width: 806px) 100vw, 806px\" \/><\/p>\n<p><strong>In\u00a0[20]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#get correlations of each features in dataset<\/span>\r\ncorrmat <span style=\"color: #333333;\">=<\/span> data<span style=\"color: #333333;\">.<\/span>corr()\r\ntop_corr_features <span style=\"color: #333333;\">=<\/span> corrmat<span style=\"color: #333333;\">.<\/span>index\r\nplt<span style=\"color: #333333;\">.<\/span>figure(figsize<span style=\"color: #333333;\">=<\/span>(<span style=\"color: #0000dd; font-weight: bold;\">20<\/span>,<span style=\"color: #0000dd; font-weight: bold;\">10<\/span>))\r\n<span style=\"color: #888888;\">#plot heat map<\/span>\r\ng<span style=\"color: #333333;\">=<\/span>sns<span style=\"color: #333333;\">.<\/span>heatmap(data[top_corr_features]<span style=\"color: #333333;\">.<\/span>corr(),annot<span style=\"color: #333333;\">=<\/span><span style=\"color: #007020;\">True<\/span>,cmap<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">\"RdYlGn\"<\/span>)\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1176\" height=\"717\" class=\"wp-image-1679\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/timeline-description-automatically-generated-3.png\" alt=\"Timeline Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/timeline-description-automatically-generated-3.png 1176w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/timeline-description-automatically-generated-3-300x183.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/timeline-description-automatically-generated-3-1024x624.png 1024w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/timeline-description-automatically-generated-3-768x468.png 768w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/timeline-description-automatically-generated-3-180x110.png 180w\" sizes=\"auto, (max-width: 1176px) 100vw, 1176px\" \/><\/p>\n<p>What pop&#8217;s out when looking at the correlations for the CLV is the column &#8216;Monthly Premium Auto&#8217; and the &#8216;Total Claim Amount&#8217; These might be the best features to use.<\/p>\n<p>Seems that the feature selection models differ a bit in which feature is the most important. For the first test I will keep:<\/p>\n<ul>\n<li>Total Claim Amount (high in all both tests)<\/li>\n<li>Monthly Premium Auto (high in all both tests and the highest in the correlation)<\/li>\n<li>Income (high in two tests)<\/li>\n<li>Months Since Policy Inception (High in the best features test)<\/li>\n<li>Coverage (High in the correlation)<\/li>\n<\/ul>\n<p>Machine learning Model<\/p>\n<p>We want to predict a continous number, therefore we need a linear regression model.<\/p>\n<p><strong>In\u00a0[21]:<\/strong><\/p>\n<p>from sklearn.linear_model import <a id=\"post-1673-kln-84\"><\/a>LinearRegression<\/p>\n<p>Split the dataset in train and test<\/p>\n<p>Before we are going to use the model choosen, we will first split the dataset in a train and test set. This because we want to test the performance of the model on the training set and to be able to check it&#8217;s accuracy.<\/p>\n<p><strong>In\u00a0[22]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.model_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> train_test_split\r\n<span style=\"color: #888888;\">#First try with the 5 most important features<\/span>\r\nX_5 <span style=\"color: #333333;\">=<\/span> data[[<span style=\"background-color: #fff0f0;\">'Total Claim Amount'<\/span>, <span style=\"background-color: #fff0f0;\">'Monthly Premium Auto'<\/span>, <span style=\"background-color: #fff0f0;\">'Income'<\/span>, <span style=\"background-color: #fff0f0;\">'Coverage'<\/span>, <span style=\"background-color: #fff0f0;\">'Months Since Policy Inception'<\/span>]] <span style=\"color: #888888;\">#independent columns chosen<\/span>\r\ny <span style=\"color: #333333;\">=<\/span> data[<span style=\"background-color: #fff0f0;\">'Customer Lifetime Value'<\/span>] <span style=\"color: #888888;\">#target column<\/span>\r\n<span style=\"color: #888888;\">#I want to withhold 30 % of the trainset to perform the tests<\/span>\r\nX_train, X_test, y_train, y_test<span style=\"color: #333333;\">=<\/span> train_test_split(X_5,y, test_size<span style=\"color: #333333;\">=<\/span><span style=\"color: #6600ee; font-weight: bold;\">0.3<\/span> , random_state <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">25<\/span>)\r\n<\/pre>\n<\/div>\n<p><strong>In\u00a0[23]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of X_train is: '<\/span>, X_train<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of X_test is: '<\/span>, X_test<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of Y_train is: '<\/span>, y_train<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of y_test is: '<\/span>, y_test<span style=\"color: #333333;\">.<\/span>shape)\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"807\" height=\"127\" class=\"wp-image-1714\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me.png\" alt=\"Shape Description automatically generated with medium confidence\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me.png 807w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me-300x47.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me-768x121.png 768w\" sizes=\"auto, (max-width: 807px) 100vw, 807px\" \/><\/p>\n<p><strong>In\u00a0[24]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#To check the model, I want to build a check:<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">math<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">def<\/span> <span style=\"color: #0066bb; font-weight: bold;\">print_metrics<\/span>(y_true, y_predicted, n_parameters):\r\n<span style=\"color: #888888;\">## First compute R^2 and the adjusted R^2<\/span>\r\nr2 <span style=\"color: #333333;\">=<\/span> sklm<span style=\"color: #333333;\">.<\/span>r2_score(y_true, y_predicted)\r\nr2_adj <span style=\"color: #333333;\">=<\/span> r2 <span style=\"color: #333333;\">-<\/span> (n_parameters <span style=\"color: #333333;\">-<\/span> <span style=\"color: #0000dd; font-weight: bold;\">1<\/span>)<span style=\"color: #333333;\">\/<\/span>(y_true<span style=\"color: #333333;\">.<\/span>shape[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>] <span style=\"color: #333333;\">-<\/span> n_parameters) <span style=\"color: #333333;\">*<\/span> (<span style=\"color: #0000dd; font-weight: bold;\">1<\/span> <span style=\"color: #333333;\">-<\/span> r2)\r\n<span style=\"color: #888888;\">## Print the usual metrics and the R^2 values<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Mean Square Error = '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(sklm<span style=\"color: #333333;\">.<\/span>mean_squared_error(y_true, y_predicted)))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Root Mean Square Error = '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(math<span style=\"color: #333333;\">.<\/span>sqrt(sklm<span style=\"color: #333333;\">.<\/span>mean_squared_error(y_true, y_predicted))))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Mean Absolute Error = '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(sklm<span style=\"color: #333333;\">.<\/span>mean_absolute_error(y_true, y_predicted)))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Median Absolute Error = '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(sklm<span style=\"color: #333333;\">.<\/span>median_absolute_error(y_true, y_predicted)))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'R^2 = '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(r2))\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Adjusted R^2 = '<\/span> <span style=\"color: #333333;\">+<\/span> <span style=\"color: #007020;\">str<\/span>(r2_adj))\r\n<\/pre>\n<\/div>\n<p>Linear Regression on 5 features<\/p>\n<p>Let&#8217;s try the model<\/p>\n<p><strong>In\u00a0[25]:<\/strong><\/p>\n<p><em># Linear regression model<\/em><\/p>\n<p><a id=\"post-1673-kln-113\"><\/a> model_5 = LinearRegression()<\/p>\n<p><a id=\"post-1673-kln-114\"><\/a> model_5.fit(X_train, y_train)<\/p>\n<p><strong>Out[25]:<\/strong><\/p>\n<p>LinearRegression()<\/p>\n<p><strong>In\u00a0[26]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-115\"><\/a> Predictions = model_5.predict(X_test)<\/p>\n<p><a id=\"post-1673-kln-116\"><\/a> print_metrics(y_test, Predictions, 5)<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"804\" height=\"162\" class=\"wp-image-1716\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa.png\" alt=\"Graphical user interface, text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa.png 804w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-300x60.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-768x155.png 768w\" sizes=\"auto, (max-width: 804px) 100vw, 804px\" \/><\/p>\n<p>Hmmm, that is not a good result, just over 14% reliable&#8230;<\/p>\n<p>Linear Regression on all<\/p>\n<p>Let&#8217;s try the model on all features to see if this improves<\/p>\n<p><strong>In\u00a0[27]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#I want to withhold 30 % of the trainset to perform the tests<\/span>\r\nX_train, X_test, y_train, y_test<span style=\"color: #333333;\">=<\/span> train_test_split(X,y, test_size<span style=\"color: #333333;\">=<\/span><span style=\"color: #6600ee; font-weight: bold;\">0.3<\/span> , random_state <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">25<\/span>)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of X_train is: '<\/span>, X_train<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of X_test is: '<\/span>, X_test<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of Y_train is: '<\/span>, y_train<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of y_test is: '<\/span>, y_test<span style=\"color: #333333;\">.<\/span>shape)\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"125\" class=\"wp-image-1717\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me-1.png\" alt=\"Shape Description automatically generated with medium confidence\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me-1.png 800w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me-1-300x47.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/shape-description-automatically-generated-with-me-1-768x120.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/p>\n<p><strong>In\u00a0[28]:<\/strong><\/p>\n<p><em># Linear regression model<\/em><\/p>\n<p><a id=\"post-1673-kln-125\"><\/a> model = LinearRegression()<\/p>\n<p><a id=\"post-1673-kln-126\"><\/a> model.fit(X_train, y_train)<\/p>\n<p><strong>Out[28]:<\/strong><\/p>\n<p>LinearRegression()<\/p>\n<p><strong>In\u00a0[29]:<\/strong><\/p>\n<p><a id=\"post-1673-kln-127\"><\/a> Predictions = model.predict(X_test)<\/p>\n<p><a id=\"post-1673-kln-128\"><\/a> print_metrics(y_test, Predictions, 22)<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"802\" height=\"171\" class=\"wp-image-1719\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-1.png\" alt=\"Graphical user interface, text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-1.png 802w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-1-300x64.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-1-768x164.png 768w\" sizes=\"auto, (max-width: 802px) 100vw, 802px\" \/><\/p>\n<p>This is even worse.<\/p>\n<p>Conclusion<\/p>\n<p>This model does not perform well to predict the CLV, as the CLV data is highly skewed. To improve the prediction, we could try to normalize the distribution of the CLV column. I will try this here below using Box Cox and Log (two different methods)<\/p>\n<p><strong>In\u00a0[30]:<\/strong><\/p>\n<p><em>#to see the CLV data as is (without having the extremes removed)<\/em><\/p>\n<p><a id=\"post-1673-kln-130\"><\/a> data.hist(&#8216;Customer Lifetime Value&#8217;, bins = 10)<\/p>\n<p><a id=\"post-1673-kln-131\"><\/a> plt.show()<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"381\" height=\"264\" class=\"wp-image-1680\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-7.png\" alt=\"Chart, histogram Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-7.png 381w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-7-300x208.png 300w\" sizes=\"auto, (max-width: 381px) 100vw, 381px\" \/><\/p>\n<p><strong>In\u00a0[31]:<\/strong><\/p>\n<p><em>#Chech the skewness, if p &lt; 0.05 it is skewed<\/em><\/p>\n<p><a id=\"post-1673-kln-133\"><\/a> clv = data[&#8216;Customer Lifetime Value&#8217;]<\/p>\n<p>from scipy.stats import <a id=\"post-1673-kln-134\"><\/a>shapiro<\/p>\n<p><a id=\"post-1673-kln-135\"><\/a> shapiro(clv)[1]<\/p>\n<p>\/opt\/conda\/lib\/python3.7\/site-packages\/scipy\/stats\/morestats.py:1676: UserWarning: p-value may not be accurate for N &gt; 5000.<\/p>\n<p>warnings.warn(&#8220;p-value may not be accurate for N &gt; 5000.&#8221;)<\/p>\n<p><strong>Out[31]:<\/strong><\/p>\n<p>0.0<\/p>\n<p><strong>In\u00a0[32]:<\/strong><\/p>\n<p><em>#as this does not work, let&#8217;s continue with the log function<\/em><\/p>\n<p><a id=\"post-1673-kln-137\"><\/a> log_clv = np.log(clv)<\/p>\n<p>import seaborn as <a id=\"post-1673-kln-138\"><\/a>sns<\/p>\n<p><a id=\"post-1673-kln-139\"><\/a> sns.distplot(log_clv)<\/p>\n<p><strong>Out[32]:<\/strong><\/p>\n<p>&lt;matplotlib.axes._subplots.AxesSubplot at 0x7ff0ec64ca90&gt;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"372\" height=\"262\" class=\"wp-image-1681\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-8.png\" alt=\"Chart, histogram Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-8.png 372w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-8-300x211.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-8-370x262.png 370w\" sizes=\"auto, (max-width: 372px) 100vw, 372px\" \/><\/p>\n<p><strong>In\u00a0[33]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of X_train is: '<\/span>, X_train<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of X_test is: '<\/span>, X_test<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of Y_train is: '<\/span>, y_train<span style=\"color: #333333;\">.<\/span>shape)\r\n<span style=\"color: #008800; font-weight: bold;\">print<\/span>(<span style=\"background-color: #fff0f0;\">'Shape of y_test is: '<\/span>, y_test<span style=\"color: #333333;\">.<\/span>shape)\r\n<\/pre>\n<\/div>\n<p><strong>Out[33]:<\/strong><\/p>\n<p>&lt;matplotlib.axes._subplots.AxesSubplot at 0x7ff0ec597ed0&gt;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"379\" height=\"248\" class=\"wp-image-1682\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-9.png\" alt=\"Chart, histogram Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-9.png 379w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-9-300x196.png 300w\" sizes=\"auto, (max-width: 379px) 100vw, 379px\" \/><\/p>\n<p>BoxCox improved the normal distribution a bit better. Let&#8217;s try our linear regression now.<\/p>\n<p><strong>In\u00a0[34]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#I want to withhold 30 % of the trainset to perform the tests<\/span>\r\nX_train, X_test, y_train, y_test<span style=\"color: #333333;\">=<\/span> train_test_split(X_5,boxcox_clv, test_size<span style=\"color: #333333;\">=<\/span><span style=\"color: #6600ee; font-weight: bold;\">0.3<\/span> , random_state <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">25<\/span>)\r\n<\/pre>\n<\/div>\n<p><strong>In\u00a0[35]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"> model_5<span style=\"color: #333333;\">.<\/span>fit(X_train, y_train)\r\n<\/pre>\n<\/div>\n<p><strong>Out[35]:<\/strong><\/p>\n<p>LinearRegression()<\/p>\n<p><strong>In\u00a0[36]:<\/strong><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\">Predictions_box <span style=\"color: #333333;\">=<\/span> model_5<span style=\"color: #333333;\">.<\/span>predict(X_test)\r\nprint_metrics(y_test, Predictions_box, <span style=\"color: #0000dd; font-weight: bold;\">5<\/span>)\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"161\" class=\"wp-image-1722\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-2.png\" alt=\"Graphical user interface, text Description automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-2.png 786w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-2-300x61.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-text-description-automa-2-768x157.png 768w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><\/p>\n<p>We can see a slight improvement to 18,5% now. But we need to do further feature improvement to better the result.<\/p>\n<p>Source: https:\/\/www.kaggle.com\/renatevankempen\/predicting-clv<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Predicting CLV (demo Python) Customer Lifetime Value prediction The problem In this notebook we look at the data we got via this\u00a0Kaggle dataset (CreditCard_dataset). It involves the car insurance customer lifetime value. Customer Lifetime Value Prediction( CLV ) value refers to net profit attributed to the entire future relationship with a customer. A bank will &hellip; <a href=\"https:\/\/mintea.blog\/?p=1673\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Predicting CLV (demo Python)<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":1674,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[32,63,62,55,56,26,54,64],"class_list":["post-1673","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bookmarked-articles","tag-analytic","tag-clv","tag-crm","tag-customer-analytic","tag-customer-lifecycle","tag-data","tag-data-mining","tag-python"],"_links":{"self":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1673","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1673"}],"version-history":[{"count":17,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1673\/revisions"}],"predecessor-version":[{"id":1726,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1673\/revisions\/1726"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/media\/1674"}],"wp:attachment":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1673"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1673"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1673"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}