Predictive modelling with Caret

The Iris dataset was used in Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. It includes three iris species with 50 samples each as well as some properties about each flower. In this blog, I will use the caret package from R to predict the species class of various Iris flowers. Let’s jump into the code.

Calling/Invoking all the necessary packages in R

#Calling libraries
 library(AppliedPredictiveModeling)
 library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
 ## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
 ## 
 ##     cov, smooth, var

The Iris dataframe is already included in R which we can attach using the data() command. A quick summary() and head() also gives us a nice introduction to the dataset. We can also set a seed value to output reproducable results.

#Prepare the dataset
 data(iris)
 summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 ##        Species  
 ##  setosa    :50  
 ##  versicolor:50  
 ##  virginica :50  
 ##                 
 ##                 
 ##

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 ## 1          5.1         3.5          1.4         0.2  setosa
 ## 2          4.9         3.0          1.4         0.2  setosa
 ## 3          4.7         3.2          1.3         0.2  setosa
 ## 4          4.6         3.1          1.5         0.2  setosa
 ## 5          5.0         3.6          1.4         0.2  setosa
 ## 6          5.4         3.9          1.7         0.4  setosa

nrow(iris)

## [1] 150

set.seed(9999)

The following plot command from the caret package can show us how the Species values are distributed in a pairwise plot. An observation from this plot is that Versicolor and Virginica have similar patterns and Setosa is quite distinct. An intuition is that we can predict Setosa easily and might have some challenges in Versicolor and Virginica.

#Visualising the dataset
 transparentTheme(trans = .4)
 featurePlot(x = iris[, 1:4], 
             y = iris$Species, 
             plot = "ellipse",
             auto.key = list(columns = 3))

Screen Shot 2017-09-29 at 17.04.59

Now we split up the dataset into a train and a test partition. We use 80% of the dataset to use into train and the remaining 20% into test. We also define a 10 fold cross validation method to be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

#Split into train and test dataset
 trainIndex <- createDataPartition(iris$Species, p = .8,
 list = FALSE,
 times = 1)
 train <- iris[ trainIndex,]
 test  <- iris[-trainIndex,]

nrow(train)

## [1] 120

nrow(test)

## [1] 30

#Cross validation
 fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

Our problem is a Classification problem since our output predictor variable is a class. We have several machine learning algorithms for class prediction. We implement 3 methods here and compare them to find the best fit.

First fit a decision tree model (Rpart) to the train dataset. A decision tree is a model with an “if this then that approach” and it is easy & fast to intepret with visualising the tree.

We also preprocess the dataset so that any variable with very different range of values will not affect our outcome.

dt.fit <- train(Species ~ ., data = train,
 method = "rpart",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: rpart

dt.fit

## CART
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   cp    Accuracy   Kappa
 ##   0.00  0.9550000  0.9325
 ##   0.45  0.7683333  0.6525
 ##   0.50  0.3333333  0.0000
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was cp = 0.

Accuracy is the percentage of correctly classifies instances out of all instances. From the above results we can see that the model has a good accuracy value. Kappa is similar to accuracy but it is more based on a normalised random draw of the dataset, i.e, it would be more useful for class imbalanced classifications. CP is the complexity parameter which is used to control the decision tree’s size and choose the optimal tree size. We can see that the model has chosen the tree size with the best accuracy.

Next, we predict the test dataset using the trained model. A confusion matrix is used to understand the performance of the model. It is a table wise comparison of actual and predicted values. A variable importance plot shows that Petal width is the most important variable that has helped us predict the Species class.

predictions <- predict(dt.fit, test)

confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

plot(varImp(dt.fit))

Screen Shot 2017-09-29 at 17.10.34

The decision tree has predicted an accuracy of 93%

The second algorithm we use is the K – Nearest neighbor algorithm. This is a great method for classification of the iris dataset. In simple words, it takes inputs from the neighborhood data points and predicts the test data with confidence. K is the number of segments and the algorithm has chosen the best K value based on accuracy.

knn.fit <- train(Species ~ ., data = train,
 method = "knn",
 trControl = fitControl,
 preProcess=c("center", "scale"))

knn.fit

## k-Nearest Neighbors
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   k  Accuracy  Kappa
 ##   5  0.955     0.9325
 ##   7  0.955     0.9325
 ##   9  0.945     0.9175
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was k = 7.

predictions <- predict(knn.fit, test)

confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

plot(varImp(knn.fit))

Screen Shot 2017-09-29 at 17.11.16

The KNN predicts with an accuracy of 93%

The final method we use is the Random Forest method. This method uses a set of decision trees to aggregate the final results. This way we can minimize error caused from individual decision trees. The mtry value is the number of variables available for splitting of each tree node. Here again the optimal value model is selected based on accuracy.

rf.fit <- train(Species ~ ., data = train,
 method = "rf",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

##
 ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
 ##
 ##     margin

rf.fit

## Random Forest
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   mtry  Accuracy   Kappa
 ##   2     0.9533333  0.9300
 ##   3     0.9616667  0.9425
 ##   4     0.9616667  0.9425
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was mtry = 3.

predictions <- predict(rf.fit, test)

confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

plot(varImp(rf.fit))

Screen Shot 2017-09-29 at 17.11.42.png

The random forest predicts with an accuracy of 93%

From the above analysis we can see that all models have performed very well and therefore we can use the predictions from either of the model.

Advertisements

Customer Segmentation using K-Means Clustering

Market segmentation is the process of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers (known as segments) based on some type of shared characteristics. In dividing or segmenting markets, researchers typically look for common characteristics such as shared needs, common interests, similar lifestyles or even similar demographic profiles. The overall aim of segmentation is to identify high yield segments – that is, those segments that are likely to be the most profitable or that have growth potential – so that these can be selected for special attention (i.e. become target markets). [Source: Wikipedia]

First a little introduction of the K-Means algorithm which I am going to implement here. The following process describes how this method is implemented.

1. k initial “means” are randomly generated within the data domain.

2. k clusters are created by associating every observation with the nearest mean.

3. The centroid of each of the k clusters becomes the new mean.

4. Steps 2 and 3 are repeated until convergence has been reached.

[Source: Wikipedia]

In this blog I am looking to implement a K-Means clustering algorithm to a dataset from the trusted UCI Machine Learning Repository. The Wholesale Customers dataset consists of 8 attributes and 440 Instances. It can be downloaded from here.

Attribute Information:

1) FRESH: annual spending (m.u.) on fresh products (Continuous);
2) MILK: annual spending (m.u.) on milk products (Continuous);
3) GROCERY: annual spending (m.u.)on grocery products (Continuous);
4) FROZEN: annual spending (m.u.)on frozen products (Continuous)
5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
7) CHANNEL: customers’ Channel – Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
8) REGION: customers’ Region – Lisnon, Oporto or Other (Nominal)
Descriptive Statistics:

(Minimum, Maximum, Mean, Std. Deviation)
FRESH ( 3, 112151, 12000.30, 12647.329)
MILK (55, 73498, 5796.27, 7380.377)
GROCERY (3, 92780, 7951.28, 9503.163)
FROZEN (25, 60869, 3071.93, 4854.673)
DETERGENTS_PAPER (3, 40827, 2881.49, 4767.854)
DELICATESSEN (3, 47943, 1524.87, 2820.106)

REGION Frequency
Lisbon 77
Oporto 47
Other Region 316
Total 440

CHANNEL Frequency
Horeca 298
Retail 142
Total 440

Now let us jump straight into the code.

#Call necessary libraries
library(corrplot)
#Read the input file
getwd()
setwd("<path to file>")
wholecust<-read.csv("Wholesalecustomersdata.csv",header=TRUE, sep=",")

Now, let’s understand the dataset by having a peak of the data

> head(wholecust)
 Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451

We can also get a useful summary from the dataset

> summary(wholecust)
 Channel Region Fresh Milk Grocery 
 Min. :1.000 Min. :1.000 Min. : 3 Min. : 55 Min. : 3 
 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 
 Median :1.000 Median :3.000 Median : 8504 Median : 3627 Median : 4756 
 Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796 Mean : 7951 
 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 
 Max. :2.000 Max. :3.000 Max. :112151 Max. :73498 Max. :92780 
 Frozen Detergents_Paper Delicassen 
 Min. : 25.0 Min. : 3.0 Min. : 3.0 
 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2 
 Median : 1526.0 Median : 816.5 Median : 965.5 
 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9 
 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2 
 Max. :60869.0 Max. :40827.0 Max. :47943.0

Let us now explore the dataset more and find out strong correlations among variables.

#Identify strong correlations
>w <- cor(wholecust)
>corrplot(w, method="number")

Screen Shot 2017-10-16 at 00.15.29

From the above diagram we can see that there is strong correlation among the Detergents_Paper and Grocery.

Now we need to find the optimal number of cluster K. Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.  [Source: Wikipedia]

>set.seed(123)
>k.max <- 15
>data <- as.matrix(scale(wholecust[,(3:8)]))

>wss <- sapply(1:k.max, 
 function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})

>plot(1:k.max, wss,
 type="b", pch = 19, frame = FALSE, 
 xlab="Number of clusters",
 ylab="Sum of squares")

Screen Shot 2017-10-16 at 00.47.56

From the above plot we can see that 3 or 5 is the optimal number of clusters, as we can see that after these numbers the curve remains less changing.

Let us plot out the clusters for both the scenarios

>kmm <- kmeans(wholecust[,(3:8)], 3)
>kmm


K-means clustering with 3 clusters of sizes 60, 330, 50

Cluster means:
 Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
2 8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
3 8000.04 18511.420 27573.900 1996.680 12407.360 2252.020

Clustering vector:
 [1] 2 2 2 2 1 2 2 2 2 3 2 2 1 2 1 2 2 2 2 2 2 2 1 3 1 2 2 2 3 1 2 2 2 1 2 2 1 2 3 1 1 2
 [43] 2 3 2 3 3 3 2 3 2 2 1 2 1 2 3 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2
 [85] 2 3 3 1 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 1 1
[127] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 3 2 2 2 1 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 3 2 2
[169] 2 2 2 3 2 3 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 3 3 1 2 2 3 2 2 2 3
[211] 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 3
[253] 2 1 2 1 2 2 1 1 2 2 1 2 2 3 3 2 3 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2
[295] 2 2 2 2 2 2 2 3 2 2 3 2 3 2 2 3 2 1 3 2 2 2 2 2 2 3 2 2 2 2 1 1 2 2 2 2 2 3 2 3 2 1
[337] 2 2 2 2 2 2 2 3 2 2 2 1 2 3 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1
[379] 2 2 1 2 1 2 3 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 2 1 3 2 2 2 2 2 2 2 2 2 2 3 2
[421] 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1 1 3 2 2

Within cluster sum of squares by cluster:
[1] 25765310312 28184318853 26382784678
 (between_SS / total_SS = 49.0 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

Now let’s implement K means for k=5

> kmm <- kmeans(wholecust[,(3:8)], 5)
> kmm
K-means clustering with 5 clusters of sizes 223, 23, 104, 80, 10

Cluster means:
 Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 6052.812 3266.314 4092.054 2459.682 1214.1300 990.6099
2 49296.087 4983.783 5590.304 8285.783 962.2609 2543.6957
3 21200.058 3886.423 5138.933 4119.856 1131.5192 1690.3365
4 4738.762 11609.013 18198.775 1515.388 8003.7750 1603.8000
5 21263.700 37443.300 46710.600 6287.200 21699.4000 8743.3000

Clustering vector:
 [1] 1 1 1 1 3 1 1 1 1 4 4 1 3 3 3 1 4 1 3 1 3 1 3 5 3 3 1 3 4 2 3 1 3 3 1 1 3 4 4 2 3 3
 [43] 4 4 1 4 4 5 1 4 1 1 2 4 3 1 4 4 1 1 1 5 1 4 1 5 1 3 1 1 3 3 1 3 1 3 1 4 1 1 1 4 4 3
 [85] 1 5 5 2 1 3 1 1 5 3 4 1 1 1 1 1 4 4 1 2 3 3 4 4 1 4 1 4 3 3 3 1 1 1 3 1 3 1 1 1 2 2
[127] 3 3 1 2 1 1 3 1 1 1 1 1 1 1 3 3 2 1 3 4 1 1 1 3 3 1 3 1 1 4 4 3 1 4 1 1 3 4 1 4 1 1
[169] 1 1 4 4 1 4 1 4 2 1 1 1 1 2 4 5 1 1 1 1 4 4 3 1 1 4 1 3 3 1 1 1 4 4 3 1 1 4 1 1 1 4
[211] 3 5 1 1 4 4 4 3 4 1 3 1 1 1 1 1 3 1 1 1 1 1 3 1 3 1 1 3 1 2 3 3 3 1 1 4 1 1 3 1 1 4
[253] 1 3 1 3 1 1 2 2 1 1 3 1 4 4 4 3 4 3 1 1 1 2 1 1 3 1 1 3 1 1 2 3 2 2 1 3 3 2 1 1 1 4
[295] 3 1 3 1 1 1 3 4 1 4 4 4 4 3 1 4 1 3 4 1 1 4 1 1 1 4 1 1 3 1 3 2 1 1 3 1 1 4 3 5 3 3
[337] 1 1 1 1 1 1 1 4 1 1 4 3 1 4 1 4 1 4 3 1 3 4 1 1 3 1 1 1 1 1 1 1 3 1 2 3 1 3 1 1 4 2
[379] 1 1 3 3 3 1 4 1 1 3 1 1 1 1 1 3 1 1 4 1 1 1 1 3 3 3 3 1 3 4 1 1 1 1 1 1 1 1 4 1 4 1
[421] 4 3 3 3 3 1 4 3 1 1 4 1 3 1 3 3 2 4 1 1

Within cluster sum of squares by cluster:
[1] 10060038988 11679101316 8521349738 8835879467 14108802241
 (between_SS / total_SS = 66.2 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

Let us closely look at the cluster means of both scenarios:

Scenario 1: k = 3

Screen Shot 2017-10-16 at 00.52.06

  • Cluster 1 – highest fresh-products.
  • Cluster 2 – low spenders.
  • Cluster 3 – highest milk, grocery, detergents_papers spenders.

Scenario 2: k = 5

Screen Shot 2017-10-16 at 00.54.11

  • Cluster 1 – low spenders
  • Cluster 2 – highest Fresh spenders
  • Cluster 3 – mediocre spenders
  • Cluster 4 – low spenders
  • Cluster 5 – mediocre Fresh, highest milk, Grocery, detergents_papers

From the above 2 analysis we can see that 3 clusters prove to be the base optimal number for quickly understanding the customer segmentation.

Possible extensions to the project is to use Principal Component Analysis for dimensionality reduction.

Predict if a client will subscribe to a term deposit using decision trees

A term deposit is a deposit with a specified period of maturity and earns interest. It is a money deposit at a banking institution that cannot be withdrawn for a specific term or period of time (unless a penalty is paid).  [Source: Wikipedia]

Predicting if a client will subscribe to a term deposit can help increase the efficiency of a marketing campaign and help us understand the factors that influence a successful outcome (subscription) from a client.

In this blog post, I will be using the bank marketing data set from the UCI Machine Learning Repository that can be downloaded from here. Let’s get started by invoking the necessary packages in R.

#Call necessary libraries
> library(rpart)
> library(caret)
> library(AppliedPredictiveModeling)

Now, let’s read the input file in CSV format specifying the path to the file. We also specify the header and separators using the read.csv function.

#Read the input file
> getwd()
> setwd("/Users/<path-to-csv>")
> bank<-read.csv("bank.csv",header=TRUE, sep=";")

The first thing to do is to explore the dataset

#Understand the structure of the dataset
> names(bank)
 [1] "age" "job" "marital" "education" "default" "balance" "housing" 
 [8] "loan" "contact" "day" "month" "duration" "campaign" "pdays" 
[15] "previous" "poutcome" "y" 
> nrow(bank)
[1] 4521
> ncol(bank)
[1] 17
> str(bank)
'data.frame': 4521 obs. of  17 variables:'data.frame': 4521 obs. of  17 variables: 
$ age      : int  30 33 35 30 59 35 36 39 41 43 ... 
$ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ... 
$ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ... 
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ... 
$ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... 
$ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ... 
$ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ... 
$ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ... 
$ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ... 
$ day      : int  19 11 16 3 5 23 14 6 14 17 ... 
$ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ... 
$ duration : int  79 220 185 199 226 141 341 151 57 313 ... 
$ campaign : int  1 1 1 4 1 2 1 2 2 1 ... 
$ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ... 
$ previous : int  0 4 1 0 0 3 2 0 0 2 ... 
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ... 
$ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

As we can see, the dataset has 17 variables and 4521 records. The target variable y has a binary outcome for the subscription status of the client (“Yes” or “No”). Because of this we will look into a classification type algorithm. All other details of the input variables can be found in the UCI website.

On initial examination all the input variables seem to important to impact the client’s decision, but how much can human intuition be useful? Let’s test it out using some initial exploratory visualisations.

> transparentTheme(trans = .4)
> featurePlot(x = bank[, c(1,6,12,13,14,15)], 
 y = bank$y, 
 plot = "pairs",
 auto.key = list(columns = 2))

Screen Shot 2017-09-26 at 18.18.14

From the above plot we can see how some variables impact other variables. The age and balance are related so we can see that lower the age higher the balance. Campaign and age variables show a relation pattern as well, but how do all these variables impact the outcome? By looking at the same graph colour-wise (red being a “no” and blue being a “yes” to subscription) we are also able to see that younger people have more “no” outcomes than “yes”. We can also plot the target variable by age to understand the distribution.

>boxplot(bank$age~bank$y, ylab="Age",xlab="y")

Screen Shot 2017-10-15 at 23.41.05

Now let’s focus on the target variable.

> table(bank$y)

no yes 
4000 521

> table(bank$y)/nrow(bank)

no yes 
0.88476 0.11524

From the above we can also see that the class variable is unbalanced. For the sake of simplicity in this blog, we do not address this issue here.

Now, let us begin the machine learning analysis. First we set the seed value so that we get the same results for the trained model every time we run it. Then, we split the dataset into training and test sets. We allocate 60% of the data size into the training set and the remaining 40% to the test set.

> set.seed(999)
> trainIndex <- createDataPartition(bank$y, p = .6, 
 list = FALSE, 
 times = 1)
> train <- bank[ trainIndex,]
> test <- bank[-trainIndex,]

We create a 10 fold cross-validation method to help train the dataset. It will be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

The data is now ready to be trained. We are using a recursive partitioning and regression tree method for the purpose of this blog.

Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached. [Source: Wikipedia]

> fit <- train(y ~ ., data = train, 
 method = "rpart", 
 trControl = fitControl)

Let us now look at the results of the trained model.

> fit
CART

2713 samples
 16 predictor
 2 classes: 'no', 'yes'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 2442, 2441, 2441, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

cp Accuracy Kappa 
 0.02236422 0.8951085 0.3912704
 0.05111821 0.8860340 0.2088233
 0.05271565 0.8850749 0.1745803

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02236422.

The next step is to use the trained machine learning model to predict the test dataset.

> predictions <- predict(fit, test)

The results of our predictions can be viewed using a confusion matrix – which is a table of the actual and predicted values.

> conf.mat <- table(test$y,predictions)
> acc <- sum(diag(conf.mat))/sum(conf.mat)
> acc

[1] 0.8904867

Our model has reached an accuracy of 89% which is a good score. An ROC plot can also illustrate our results better.

Possible extensions to our project include fitting different models and comparing individual accuracies against each other.

References: 

http://www.columbia.edu/~jc4133/ADA-Project.pdf

https://rpubs.com/nadi27oct/259406

Social Network Sentiment Analysis with twitteR

Public Sentiment Analysis of a trend or event has proven to be useful in many ways. In important business decisions, marketing campaigning or introducing a new product it is always beneficial to have public emotion on social media. In this blog, I’ve outlined the methodology that I used to analyse the public sentiment for the general election trend on Twitter. I am using the “twitteR” package in R programming language.

The first thing to do is to extract the tweets. You need to create a Twitter Dev App using the https://apps.twitter.com/. You need to fill in few details and have your mobile phone already registered with your twitter account. Once this is done with your newly created app, make note of your consumer key and consumer secret from the Details page of your app.

Once we have all the necessary packages installed in R, we need to have an OAuth authentication. Jeff Gentry, the author of the twitteR package has shown us how to do the OAuth setup here

Library and OAuth Setting up

#Calling necessary library functions
>library(twitteR)
>library(ggplot2)

#Enter your consumer key and consumer secret values
>setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")

Retrieving the tweets

Twitter collects about millions of tweets per day. Handling “big data” is a whole new topic which I will skip in this blog. I am therefore going to limit to a reasonable amount of tweets that I can use for easy processing and analysis so I can focus on the actual analysis. In the below code lines I am extracting 5000 tweets and stripping off the duplicate retweets.

#Retrieving most recent tweets about the general election this month
>e.tweets<-searchTwitter('#generalelection', since='2017-06-06', until='2017-06-08', n=5000, lang='en', resultType = 'recent')

#Stripping the duplicated retweets from our collection
>e.tweets<-strip_retweets(e.tweets, strip_manual = TRUE, strip_mt = TRUE)
> length(e.tweets)
[1] 1276

Let’s now convert the tweets into a data frame and have a peek at the data.

>tweets.df <- twListToDF(e.tweets)
>head(tweets.df)

text
1 7 hours til voting time and I still don't have a clue what to do #help #GeneralElection #givemeapoliticslesson #plz
2 MSM doesn't realise this is Rupert Murdoch's last #GeneralElection after 40 years of interference and corruption of British democracy.
3 I have no doubt that the Tories will win the #GeneralElection ...too many people are afraid to vote for something different and better.
4 No more sleeps. Less than 7 hours to show if the latest 7 point Conservative poll predicted lead holds true #generalelection
5 Me toooo, hell yeah we're gonna make a difference \xed\xa0\xbd\xed\xb2\xaa\xed\xa0\xbc\xed\xbf\xbc\xed\xa0\xbd\xed\xb2\x81\xed\xa0\xbc\xed\xbf\xbb #GeneralElection https://t.co/DCas5Y8Q46
6 Go Gary #GeneralElection #GE2017 #VoteLabour #VoteConservative #VoteTory #VoteForChange #ForTheMany #VoteLibDem… https://t.co/thJDlDZgdC
 favorited favoriteCount replyToSN created truncated replyToSID
1 FALSE 0 <NA> 2017-06-07 23:59:31 FALSE <NA>
2 FALSE 13 <NA> 2017-06-07 23:59:12 FALSE <NA>
3 FALSE 0 <NA> 2017-06-07 23:58:45 FALSE <NA>
4 FALSE 0 <NA> 2017-06-07 23:58:23 FALSE <NA>
5 FALSE 1 <NA> 2017-06-07 23:58:03 FALSE <NA>
6 FALSE 0 <NA> 2017-06-07 23:57:35 TRUE <NA>
 id replyToUID
1 872603961286684678 <NA>
2 872603881125154816 <NA>
3 872603769795727360 <NA>
4 872603679182004224 <NA>
5 872603592368295940 <NA>
6 872603474449575936 <NA>
 statusSource
1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
3 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
4 <a href="http://www.echofon.com/" rel="nofollow">Echofon</a>
5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
 screenName retweetCount isRetweet retweeted longitude latitude
1 whatfrandid 0 FALSE FALSE <NA> <NA>
2 TheMurdochTimes 4 FALSE FALSE <NA> <NA>
3 mattrobin140s 0 FALSE FALSE <NA> <NA>
4 juliahobsbawm 0 FALSE FALSE <NA> <NA>
5 helbigfordeyes 0 FALSE FALSE <NA> <NA>
6 theokoulouris 0 FALSE FALSE <NA> <NA>

Now let’s git rid of some unwanted information and extract only the texts.

#Extracting the text from the tweets
> tweets.text<-tweets.df$text
> head(tweets.text)

[1] "7 hours til voting time and I still don't have a clue what to do #help #GeneralElection #givemeapoliticslesson #plz" 
[2] "MSM doesn't realise this is Rupert Murdoch's last #GeneralElection after 40 years of interference and corruption of British democracy." 
[3] "I have no doubt that the Tories will win the #GeneralElection ...too many people are afraid to vote for something different and better." 
[4] "No more sleeps. Less than 7 hours to show if the latest 7 point Conservative poll predicted lead holds true #generalelection" 
[5] "Me toooo, hell yeah we're gonna make a difference \xed\xa0\xbd\xed\xb2\xaa\xed\xa0\xbc\xed\xbf\xbc\xed\xa0\xbd\xed\xb2\x81\xed\xa0\xbc\xed\xbf\xbb #GeneralElection https://t.co/DCas5Y8Q46"
[6] "Go Gary #GeneralElection #GE2017 #VoteLabour #VoteConservative #VoteTory #VoteForChange #ForTheMany #VoteLibDem… https://t.co/thJDlDZgdC"

This looks better to read.

Collecting positive and negative words

Next, I am taking a list of all possible positive and negative words from the website http://ptrckprry.com/. The negative list was missing a few words like “wtf”, etc., so I manually add them!

#After changing the required directory, read the positive and negative files
>pos<-scan('positive.txt', what='character', comment.char = ";")
>neg<-scan('negative.txt', what='character', comment.char = ";")
> head(pos)
[1] "a+" "abound" "abounds" "abundance" "abundant" "accessable"
> head(neg)
[1] "2-faced" "2-faces" "abnormal" "abolish" "abominable" "abominably"

Sentiment scores

I am using a function by Jeffrey Breen with a slight modification to remove certain unwanted characters / unicode blocks.

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
 require(plyr)
 require(stringr)
 
 # we got a vector of sentences. plyr will handle a list
 # or a vector as an "l" for us
 # we want a simple array ("a") of scores back, so we use 
 # "l" + "a" + "ply" = "laply":
 scores = laply(sentences, function(sentence, pos.words, neg.words) {
 
 # clean up sentences with R's regex-driven global substitute, gsub():
 sentence = gsub('[[:punct:]]', '', sentence)
 sentence = gsub('[[:cntrl:]]', '', sentence)
 sentence = gsub('\\d+', '', sentence)
 sentence = gsub('\\x', '', sentence)

 # remove unicodes
 sentence<-iconv(sentence, "ASCII", "UTF-8", sub="")
 
 # and convert to lower case:
 sentence = tolower(sentence)
 
 # split into words. str_split is in the stringr package
 word.list = str_split(sentence, '\\s+')
 # sometimes a list() is one level of hierarchy too much
 words = unlist(word.list)
 
 # compare our words to the dictionaries of positive & negative terms
 pos.matches = match(words, pos.words)
 neg.matches = match(words, neg.words)
 
 # match() returns the position of the matched term or NA
 # we just want a TRUE/FALSE:
 pos.matches = !is.na(pos.matches)
 neg.matches = !is.na(neg.matches)
 
 # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
 score = sum(pos.matches) - sum(neg.matches)
 
 return(score)
 }, pos.words, neg.words, .progress=.progress )
 
 scores.df = data.frame(score=scores, text=sentences)
 return(scores.df)
}

Here’s an example of how the function works. I have written 3 sentences with different sentiments. You can see the function has scored it accordingly.

> sample <- c("All flights delayed, such a dreadful experience", 
 "Enjoyed & looking forward to another informative conference",
 "Today you are you! That is truer than true!")
> result <- score.sentiment(sample, pos, neg)
> result$score
[1] -2 1 0

Now let’s score our tweets and plot them out.

>sentence<-tweets.text
>e.result <- score.sentiment(tweets.text,pos,neg)
>sentiment<-e.result$score

>qplot(sentiment,
 geom="histogram",
 binwidth = 0.5,
 color= sentiment >= 0,
 xlab = "Sentiment",
 ylab = "Number of tweets",
 main = "Social Network Sentiment Analysis of General Election")

Election

The negative tweets are in red and the neutral/positive is shown in green. We can see positive tweets slightly higher in number than negatives, if we ignore the neutral.

Now let’s see how the function works if I extract labour and conservative tweets separately. After changing the search term to #labour and #conservatives in the above code, I am getting the following graphs. Both the graphs look very similar.

Labour

 

Conservatives

Possible extensions to the project

  • Invoke big data technologies to analyse huge number of tweets to find hourly sentiments for a longer continuous time period.
  • More detailed visualisations of the tweets and sentiments.
  • Use Machine learning to predict sentiments based on past trends. Reduce neutrality of tweets by a more sophisticated model.

Linear Regression in R

Linear regression is a fundamental method used as part of Regression analysis. It is a quick and an easy way to understand how predictive algorithms work in general and in the field of machine learning.

To give a simple overview of how the algorithm works, a linear regression fits a linear relationship for a given target variable(Y) using one or many scalar dependent variables(X).

For example consider the age, height and weight of children. By intuition we can say that these variables are correlated. With increase in height and age there is also an increase in weight. A linear regression would fit a model or an equation for the target variable assuming the relation between X and Y is linear. In many practical applications a linear regression model works effectively.

Linear_regression.svg

See figure above of an example of a linear regression model. The red line indicates a linear regression model of a target variable (Y axis) with one dependent variable (X axis)

Let’s perform a linear regression analysis using the trees dataset in R

First let’s attach the dataset in the R workspace and have a sneak peak of how the data looks.

> data(trees)
> head(trees)
  Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
> str(trees)
'data.frame': 31 obs. of 3 variables:
 $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
> summary(trees)
 Girth Height Volume 
 Min. : 8.30 Min. :63 Min. :10.20 
 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40 
 Median :12.90 Median :76 Median :24.20 
 Mean :13.25 Mean :76 Mean :30.17 
 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30 
 Max. :20.60 Max. :87 Max. :77.00

The head() function gives us the a top sample of dataset. We can see the dataset has 3 variables namely the Girth, Height and Volume. From the str() function we can get a quick understanding of how the data of each variable looks like. For a more detailed understanding use summary() to see the mean, median and quartiles of each variable.

Let us now do some exploratory analysis of these variables by plotting them out.

> plot(trees)

Rplot

We can see from the above plot that volume and girth have a linear relationship. Let’s build a linear model using the knowledge we have gained.  Let’s assume we want to predict the volume of a tree using the girth.

> lmod<-lm(Volume~Girth, data = trees)
> summary(lmod)

Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
 Min 1Q Median 3Q Max 
-8.065 -3.107 0.152 3.495 9.587 

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331 
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

The lm() function calls a linear model and you can see the target and dependent variables the model uses. By using the summary we can get the details of the model.

The model has an intercept of -36.9435 and a slope of 5.0659. The intercept indicates the point where the equation passes through the Y axis and the slope is the steepness of the linear equation. So for our linear equation Y = a + bX, the model denotes that b is -36.9435 and a is 5.0659 where Y is Volume and X is Girth.

From the Signif. code we can see that Girth is statistically significant for our model. We can plot our model for better understanding.

Rplot01.png

That’s it! We have build our first linear model. The red line indicates that we are able to draw a line to match most of our data points. Some data points are further away to the line and some are quite close. The distance between the predicted and actual values is the residual. In the following blogs let’s see how we can improve our model fit further.

The types of machine learning problems

I have written this 2 part blog to articulate the technical aspects of machine learning in layman’s terms. For part 1 of this series click here


In the last blog we looked into a small introduction to machine learning and why it is important.  I suggest you read the last blog to get a better understanding from this one.

This time we will dive into a more technical introduction to the types of machine learning problems. Let’s look closely into the situations you might come across when you are trying to build your own predictive model.

220px-Chocolate_Cupcakes_with_Raspberry_Buttercream.jpg

Situation 1

Imagine you own a bakery. Your business seems to be quite popular among all types of customers – kids, teens, adults.  But you want to know if the people truly like your bakery or not. It can depend on anything (e.g., the order they place, their age, their favourite flavour, suggestions from their family, their friends). These are the predictor variables that impact our answer. But the answer you are looking forward is a simple Yes or No. Do people like your bakery or not? This type of machine learning is known as classification. Sometimes there are more than 2 categories. For example how much do people like your bakery (Very much, Quite a bit, Not at all). These are ordinal classifiers. Ordinal classifiers can also be 1, 2 or 3 but remember this is not the same as regression (see below)

Situation 2

You are the owner of the same bakery. But you want more than a classification answer. You want to go straight to the target and find out how much a customer might spend based on their historic data. You are now looking at a numerical scale measurement for an answer. It can range anywhere from £5 to £15 per visit. Imagine every time you see a new customer walk into your bakery you see the amount they are most likely to spend floating above their head. This is a regression situation.

Situation 3

You don’t know what you want to know. You just want to know if there are groups of customers who are likely to act in a particular way. Do little kids always go for the cupcakes with cartoon characters. Do young teens with their girlfriend / boyfriend go for heart shaped one? You want the data to frame the question and answer it. We are looking for patterns, groups or clusters in the data. This is the Clustering problem

In situations 1 and 2 we have a question framed, we have a set of predictors that we think might influence the answer to our question. This type of machine learning is known as supervised learning. In situation 3, we did not have any question in our mind but we are looking to find patterns or groups from the data. This is known as unsupervised learning.

Summary:

  • Classification: supervised machine learning method where the output variable takes the form of class labels.
  • Regression: supervised machine learning method where the output variable takes the form of continuous values.
  • Clustering: unsupervised machine learning method where we group a set of objects and find whether there is some relationship between the objects

For part 1 of this series click here

Or read my blog on big data