Predict if a client will subscribe to a term deposit using decision trees

A term deposit is a deposit with a specified period of maturity and earns interest. It is a money deposit at a banking institution that cannot be withdrawn for a specific term or period of time (unless a penalty is paid).  [Source: Wikipedia]

Predicting if a client will subscribe to a term deposit can help increase the efficiency of a marketing campaign and help us understand the factors that influence a successful outcome (subscription) from a client.

In this blog post, I will be using the bank marketing data set from the UCI Machine Learning Repository that can be downloaded from here. Let’s get started by invoking the necessary packages in R.

#Call necessary libraries
> library(rpart)
> library(caret)
> library(AppliedPredictiveModeling)

Now, let’s read the input file in CSV format specifying the path to the file. We also specify the header and separators using the read.csv function.

#Read the input file
> getwd()
> setwd("/Users/<path-to-csv>")
> bank<-read.csv("bank.csv",header=TRUE, sep=";")

The first thing to do is to explore the dataset

#Understand the structure of the dataset
> names(bank)
 [1] "age" "job" "marital" "education" "default" "balance" "housing" 
 [8] "loan" "contact" "day" "month" "duration" "campaign" "pdays" 
[15] "previous" "poutcome" "y" 
> nrow(bank)
[1] 4521
> ncol(bank)
[1] 17
> str(bank)
'data.frame': 4521 obs. of  17 variables:'data.frame': 4521 obs. of  17 variables: 
$ age      : int  30 33 35 30 59 35 36 39 41 43 ... 
$ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ... 
$ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ... 
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ... 
$ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... 
$ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ... 
$ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ... 
$ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ... 
$ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ... 
$ day      : int  19 11 16 3 5 23 14 6 14 17 ... 
$ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ... 
$ duration : int  79 220 185 199 226 141 341 151 57 313 ... 
$ campaign : int  1 1 1 4 1 2 1 2 2 1 ... 
$ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ... 
$ previous : int  0 4 1 0 0 3 2 0 0 2 ... 
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ... 
$ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

As we can see, the dataset has 17 variables and 4521 records. The target variable y has a binary outcome for the subscription status of the client (“Yes” or “No”). Because of this we will look into a classification type algorithm. All other details of the input variables can be found in the UCI website.

On initial examination all the input variables seem to important to impact the client’s decision, but how much can human intuition be useful? Let’s test it out using some initial exploratory visualisations.

> transparentTheme(trans = .4)
> featurePlot(x = bank[, c(1,6,12,13,14,15)], 
 y = bank$y, 
 plot = "pairs",
 auto.key = list(columns = 2))

Screen Shot 2017-09-26 at 18.18.14

From the above plot we can see how some variables impact other variables. The age and balance are related so we can see that lower the age higher the balance. Campaign and age variables show a relation pattern as well, but how do all these variables impact the outcome? By looking at the same graph colour-wise (red being a “no” and blue being a “yes” to subscription) we are also able to see that younger people have more “no” outcomes than “yes”. We can also plot the target variable by age to understand the distribution.

>boxplot(bank$age~bank$y, ylab="Age",xlab="y")

Screen Shot 2017-10-15 at 23.41.05

Now let’s focus on the target variable.

> table(bank$y)

no yes 
4000 521

> table(bank$y)/nrow(bank)

no yes 
0.88476 0.11524

From the above we can also see that the class variable is unbalanced. For the sake of simplicity in this blog, we do not address this issue here.

Now, let us begin the machine learning analysis. First we set the seed value so that we get the same results for the trained model every time we run it. Then, we split the dataset into training and test sets. We allocate 60% of the data size into the training set and the remaining 40% to the test set.

> set.seed(999)
> trainIndex <- createDataPartition(bank$y, p = .6, 
 list = FALSE, 
 times = 1)
> train <- bank[ trainIndex,]
> test <- bank[-trainIndex,]

We create a 10 fold cross-validation method to help train the dataset. It will be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

The data is now ready to be trained. We are using a recursive partitioning and regression tree method for the purpose of this blog.

Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached. [Source: Wikipedia]

> fit <- train(y ~ ., data = train, 
 method = "rpart", 
 trControl = fitControl)

Let us now look at the results of the trained model.

> fit
CART

2713 samples
 16 predictor
 2 classes: 'no', 'yes'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 2442, 2441, 2441, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

cp Accuracy Kappa 
 0.02236422 0.8951085 0.3912704
 0.05111821 0.8860340 0.2088233
 0.05271565 0.8850749 0.1745803

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02236422.

The next step is to use the trained machine learning model to predict the test dataset.

> predictions <- predict(fit, test)

The results of our predictions can be viewed using a confusion matrix – which is a table of the actual and predicted values.

> conf.mat <- table(test$y,predictions)
> acc <- sum(diag(conf.mat))/sum(conf.mat)
> acc

[1] 0.8904867

Our model has reached an accuracy of 89% which is a good score. An ROC plot can also illustrate our results better.

Possible extensions to our project include fitting different models and comparing individual accuracies against each other.

References: 

http://www.columbia.edu/~jc4133/ADA-Project.pdf

https://rpubs.com/nadi27oct/259406

Advertisements

Linear Regression in R

Linear regression is a fundamental method used as part of Regression analysis. It is a quick and an easy way to understand how predictive algorithms work in general and in the field of machine learning.

To give a simple overview of how the algorithm works, a linear regression fits a linear relationship for a given target variable(Y) using one or many scalar dependent variables(X).

For example consider the age, height and weight of children. By intuition we can say that these variables are correlated. With increase in height and age there is also an increase in weight. A linear regression would fit a model or an equation for the target variable assuming the relation between X and Y is linear. In many practical applications a linear regression model works effectively.

Linear_regression.svg

See figure above of an example of a linear regression model. The red line indicates a linear regression model of a target variable (Y axis) with one dependent variable (X axis)

Let’s perform a linear regression analysis using the trees dataset in R

First let’s attach the dataset in the R workspace and have a sneak peak of how the data looks.

> data(trees)
> head(trees)
  Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
> str(trees)
'data.frame': 31 obs. of 3 variables:
 $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
> summary(trees)
 Girth Height Volume 
 Min. : 8.30 Min. :63 Min. :10.20 
 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40 
 Median :12.90 Median :76 Median :24.20 
 Mean :13.25 Mean :76 Mean :30.17 
 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30 
 Max. :20.60 Max. :87 Max. :77.00

The head() function gives us the a top sample of dataset. We can see the dataset has 3 variables namely the Girth, Height and Volume. From the str() function we can get a quick understanding of how the data of each variable looks like. For a more detailed understanding use summary() to see the mean, median and quartiles of each variable.

Let us now do some exploratory analysis of these variables by plotting them out.

> plot(trees)

Rplot

We can see from the above plot that volume and girth have a linear relationship. Let’s build a linear model using the knowledge we have gained.  Let’s assume we want to predict the volume of a tree using the girth.

> lmod<-lm(Volume~Girth, data = trees)
> summary(lmod)

Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
 Min 1Q Median 3Q Max 
-8.065 -3.107 0.152 3.495 9.587 

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331 
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

The lm() function calls a linear model and you can see the target and dependent variables the model uses. By using the summary we can get the details of the model.

The model has an intercept of -36.9435 and a slope of 5.0659. The intercept indicates the point where the equation passes through the Y axis and the slope is the steepness of the linear equation. So for our linear equation Y = a + bX, the model denotes that b is -36.9435 and a is 5.0659 where Y is Volume and X is Girth.

From the Signif. code we can see that Girth is statistically significant for our model. We can plot our model for better understanding.

Rplot01.png

That’s it! We have build our first linear model. The red line indicates that we are able to draw a line to match most of our data points. Some data points are further away to the line and some are quite close. The distance between the predicted and actual values is the residual. In the following blogs let’s see how we can improve our model fit further.

Data Visualisation: Open Sourcing Mental Illness

This viz won the Tableau Viz of the Day on 31/03/2017 with over 2000 views. It also won third place in the worldwide competition for #DataForACause


Data For a Cause is an exciting challenge for participants to contribute their data science / visualisation skills for a good cause. A Not-for-Profit organisation comes with a social issue and a relevant data set. Volunteers analyse the data for a week and come up with interesting insights or visualisation pieces. You can see their website here.

I had the chance to contribute this time for Open Sourcing Mental Illness (OSMI) on the survey data they had collected. The survey was about mental health illness in tech jobs. I created a data visualisation for raising awareness on this issue to tech organisations. The most significant finding was that of all the respondents who had mental health issues, nearly half did not seek treatment. My main goal of the visualisation was to make people aware of this issue and help them reach out to OSMI to seek help.

The interactive version is here

OSMI.png