Predict if a client will subscribe to a term deposit using decision trees

A term deposit is a deposit with a specified period of maturity and earns interest. It is a money deposit at a banking institution that cannot be withdrawn for a specific term or period of time (unless a penalty is paid).  [Source: Wikipedia]

Predicting if a client will subscribe to a term deposit can help increase the efficiency of a marketing campaign and help us understand the factors that influence a successful outcome (subscription) from a client.

In this blog post, I will be using the bank marketing data set from the UCI Machine Learning Repository that can be downloaded from here. Let’s get started by invoking the necessary packages in R.

#Call necessary libraries
> library(rpart)
> library(caret)
> library(AppliedPredictiveModeling)

Now, let’s read the input file in CSV format specifying the path to the file. We also specify the header and separators using the read.csv function.

#Read the input file
> getwd()
> setwd("/Users/<path-to-csv>")
> bank<-read.csv("bank.csv",header=TRUE, sep=";")

The first thing to do is to explore the dataset

#Understand the structure of the dataset
> names(bank)
 [1] "age" "job" "marital" "education" "default" "balance" "housing" 
 [8] "loan" "contact" "day" "month" "duration" "campaign" "pdays" 
[15] "previous" "poutcome" "y" 
> nrow(bank)
[1] 4521
> ncol(bank)
[1] 17
> str(bank)
'data.frame': 4521 obs. of  17 variables:'data.frame': 4521 obs. of  17 variables: 
$ age      : int  30 33 35 30 59 35 36 39 41 43 ... 
$ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ... 
$ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ... 
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ... 
$ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... 
$ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ... 
$ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ... 
$ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ... 
$ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ... 
$ day      : int  19 11 16 3 5 23 14 6 14 17 ... 
$ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ... 
$ duration : int  79 220 185 199 226 141 341 151 57 313 ... 
$ campaign : int  1 1 1 4 1 2 1 2 2 1 ... 
$ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ... 
$ previous : int  0 4 1 0 0 3 2 0 0 2 ... 
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ... 
$ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

As we can see, the dataset has 17 variables and 4521 records. The target variable y has a binary outcome for the subscription status of the client (“Yes” or “No”). Because of this we will look into a classification type algorithm. All other details of the input variables can be found in the UCI website.

On initial examination all the input variables seem to important to impact the client’s decision, but how much can human intuition be useful? Let’s test it out using some initial exploratory visualisations.

> transparentTheme(trans = .4)
> featurePlot(x = bank[, c(1,6,12,13,14,15)], 
 y = bank$y, 
 plot = "pairs",
 auto.key = list(columns = 2))

Screen Shot 2017-09-26 at 18.18.14

From the above plot we can see how some variables impact other variables. The age and balance are related so we can see that lower the age higher the balance. Campaign and age variables show a relation pattern as well, but how do all these variables impact the outcome? By looking at the same graph colour-wise (red being a “no” and blue being a “yes” to subscription) we are also able to see that younger people have more “no” outcomes than “yes”. We can also plot the target variable by age to understand the distribution.

>boxplot(bank$age~bank$y, ylab="Age",xlab="y")

Screen Shot 2017-10-15 at 23.41.05

Now let’s focus on the target variable.

> table(bank$y)

no yes 
4000 521

> table(bank$y)/nrow(bank)

no yes 
0.88476 0.11524

From the above we can also see that the class variable is unbalanced. For the sake of simplicity in this blog, we do not address this issue here.

Now, let us begin the machine learning analysis. First we set the seed value so that we get the same results for the trained model every time we run it. Then, we split the dataset into training and test sets. We allocate 60% of the data size into the training set and the remaining 40% to the test set.

> set.seed(999)
> trainIndex <- createDataPartition(bank$y, p = .6, 
 list = FALSE, 
 times = 1)
> train <- bank[ trainIndex,]
> test <- bank[-trainIndex,]

We create a 10 fold cross-validation method to help train the dataset. It will be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

The data is now ready to be trained. We are using a recursive partitioning and regression tree method for the purpose of this blog.

Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached. [Source: Wikipedia]

> fit <- train(y ~ ., data = train, 
 method = "rpart", 
 trControl = fitControl)

Let us now look at the results of the trained model.

> fit
CART

2713 samples
 16 predictor
 2 classes: 'no', 'yes'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 2442, 2441, 2441, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

cp Accuracy Kappa 
 0.02236422 0.8951085 0.3912704
 0.05111821 0.8860340 0.2088233
 0.05271565 0.8850749 0.1745803

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02236422.

The next step is to use the trained machine learning model to predict the test dataset.

> predictions <- predict(fit, test)

The results of our predictions can be viewed using a confusion matrix – which is a table of the actual and predicted values.

> conf.mat <- table(test$y,predictions)
> acc <- sum(diag(conf.mat))/sum(conf.mat)
> acc

[1] 0.8904867

Our model has reached an accuracy of 89% which is a good score. An ROC plot can also illustrate our results better.

Possible extensions to our project include fitting different models and comparing individual accuracies against each other.

References: 

http://www.columbia.edu/~jc4133/ADA-Project.pdf

https://rpubs.com/nadi27oct/259406

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s