Data Science in a nutshell

I intend to keep this blog post as a glossary for some of the most important concepts in data science. All of these concepts are immense topics on their own. I won’t be diving deep into each one, but what I will try to do is to just give a basic summary of what it is. All explanations are my own with some inspiration from wikipedia and An Introduction to Statistical Learning by Gareth JamesDaniela WittenTrevor Hastie and Robert Tibshirani – one of the best data science books of all time!

What is Data Science, Machine Learning, AI, Deep learning?

All these terms seems to imply the same and it is quite easy to mix up the meanings, but they are all different from each other.

Data Science is a field in which knowledge is extracted using data, statistics, mathematics and advanced analysis techniques. The fundamental differences between a data scientist and a data analyst is the way the analysis is done and what it is used for. A data scientist uses statistical analysis in order to prove a hypothesis or predict future trends. A data analyst would slice and dice data to look for KPI, trends without using any machine learning techniques.

Machine Learning is the process of developing algorithms to predict future data trends or find patterns in the data.

Artificial Intelligence is widely used to denote the way machines work on their own. In simplest terms, once we train an algorithm to learn about something and test it to see if it’s learnt enough we then use the algorithm to predict something we as humans cannot do. This is AI.

Deep Learning is a specific type of machine learning where the algorithm learns through complex network patterns and many layers, sometimes as complicated as human thoughts. (Yes, we humans are the most complex machines on earth!)

Can machine learning steal our jobs?

There is a huge claim that machine learning will replace humans soon and we may all become obsolete – or worse we might be terminated by robots. This is not the case (again, in my opinion) but I do love a good science fiction story! When computers were invented, we were worried they will replace humans soon but we learnt to work with them. I think we as humans can adapt with the change in order to survive.

Here are some key definitions of machine learning

Supervised Learning 

This is a type of problem where we are given a data set to learn from and try to predict something (or some variable) from it. For example, we might learn about millions of patient’s blood test results data and predict if a new patient is likely to have a disease. There are 2 types of supervised learning methods

  1. Regression: Here we try to predict a continuous variable (most likely a number) from a data set. An example is where we learn about the different characteristics of a house such as the house type, the location, whether it has a swimming pool, the number of rooms, etc. Once we learn all this we try to predict the house price which is a numerical variable.
  2. Classification: This is used to predict a class variable. In the same house example, say we want to predict the house would sell in the next 3 months or not. The variable we want to predict is a Yes / No answer. This is a classification problem. However in many cases a classification problem may have more than 2 answers. For example we may want to know if a house would sell in the next 6 months, 1 year or 2 years. This is known as a multi-class problem.

Unsupervised learning

This is the type of problem where we have no idea what to predict but just look out for patterns in the data. A popular method used in this type is a clustering algorithm. Say we have a data set of many customers and their purchase patterns. We may want to put them into different buckets based on their behaviour. This is a clustering algorithm.

Steps to create a machine learning model

  1. Brainstorming This is the first phase in which we determine what we want the model to do. Do we want a predictive model (supervised techniques)? Are we looking to find clusters in the data (unsupervised techniques)? Are we looking to forecast a time series data? These are the questions we need to answer.
  2. Data Cleansing The unavoidable phase! No data set is perfect. We need to lookout for outliers, missing data, convert categories into dummy variables, etc.
  3. Data Visualisation This is where we get an understanding on the data through graphs/tables. Do we see a relationship between age of the customer and the way they purchase? Which is the highest influence of a house price, is it the size of the house or is it the location? Do we see a lot of patients who are diabetic to have someone in the family diabetic as well? These are the types of questions we need to answer with the visuals.
  4. Data Splitting Divide your data into training and test sets
  5. Modelling The actual fitting of the model to train the training data to learn how to classify or predict a numeric output. Get an idea of how much the model has learnt by testing the algorithm to new unseen test data. Get the accuracy of your predictions
  6. Post modelling Not all models would work the first time. We need to fine tune the parameters of the model, tweak the data set, extract features that work, eliminate features that do not contribute. Trial and test until perfection.
  7. Predictions Once you have accomplished all the above, the model is now ready to predict. It’s not easy getting here!

Bias and variance are both errors that could happen when fitting a model. Bias leads to underfitting and variance causes overfitting. Bias is an error caused while missing to see an underlying relation in the model. Variance is the erroneous behaviour  of a model that is too sensitive when there are small fluctuations in the training data. A model with high variance can misinterpret the random noise in a dataset to be a true relation whereas a model with high bias can miss important correlations in the model. Also when bias is high, variance is low and vice versa, i.e they are inversely proportional.

Of course, all the above is just a scratch on the surface of data science, there is a lot more(!) but hey, this blog post is data science in a nutshell.

Advertisements

Scraping data from a website using rvest

There are lots of web scraping tools available online, but sometimes I’d like to skip this element and prefer to write the code in R to keep everything in one place. In this blog post, I’ll be using the rvest package to show how simple it is to scrape the web and gather a neat data set for data analysis.

We’ll be scraping data from imdb. I’m going to be scraping data about Black Mirror.

First I want to scrape data on the cast, so this is the link that I will be using

http://www.imdb.com/title/tt2085059/fullcredits?ref_=tt_cl_sm#cast

Now I go into RStudio and call the necessary packages and read this html.

library(rvest)
bm <- read_html("https://www.imdb.com/title/tt2085059/fullcredits?ref_=tt_cl_sm#cast")

Next, I will be extracting data from the html we just read. For this I use the selector gadget which is a great Chrome extension to select my CSS selectors. Using this tool, I find that the CSS selector is .itemprop. If you want more information on how to use this works then go to their website . So using the following I get the entire list of cast.

bm %>% html_nodes(".itemprop") %>% html_text()

1 Daniel Lapaine 
2 Daniel Lapaine
3 Hannah John-Kamen 
4 Hannah John-Kamen
5 Michaela Coel 
6 Michaela Coel
7 Beatrice Robertson-Jones 
8 Beatrice Robertson-Jones
9 Daniel Kaluuya 
10 Daniel Kaluuya
11 Toby Kebbell 
12 Toby Kebbell
13 Rory Kinnear 
14 Rory Kinnear
15 Hayley Atwell 
16 Hayley Atwell
17 Lenora Crichlow 
18 Lenora Crichlow
19 Daniel Rigby 
20 Daniel Rigby
.
.
.

A simple gsub() would format the output and using unique() to remove duplicates.

as.data.frame(gsub(" ","",gsub("\n ", "", cast))) %>% unique()

We can also get other information from the website using the selector gadget and easily scrape using rvest.

Data Manipulation of Star Wars characters using dplyr and tidyr

The advent of several “point and click” or “drag and drop” tools have eased data manipulation for analysts. Data retrieval, wrangling and cleansing is becoming a task which anyone can perform. However in some scenarios such tools fail to manipulate advanced or complex analysis without the inclusion of typing in programming lines into custom transfiguration. Tools like R and Python are still the most preferred tools due to their advanced nature of handling statistical analysis and machine learning complexity.

Every Data Scientist or Analyst spends about 80% of his/her time in data wrangling and only about 20% of the time in the actual analysis. Many packages in R programming language are quite extensively used for complex data retrievals. The dplyr package is pretty neat in performing several data munging tasks. In this blog I am aiming to cover the various functions and capabilities of the dplyr and the tidyr packages.

With the recent Star Wars movie released, I want to use the starwars tibble for some analysis here!

Calling all libraries and datasets:

> library(dplyr)
> library(tidyr)
> data(starwars)

A tibble is quite different to a data frame in terms of printing or subsetting and it is simply a nicer way to create a data frame. It never adjusts the names of the variables of changes the input type. Check this article for more details.

Now let’s view the dataset:

> starwars

starwars

A tibble also displays just the top 10 rows and displays it on my Rstudio console. A dplyr equivalent for a data frame is the following command which also makes it easier to display the data

> dplyr::tbl_df(starwars)

If we want to just glimpse the data we can use the following as well

> dplyr::glimpse(starwars)

starwars

Now one particular format that is commonly seen in dplyr code is the piping symbol which is %>%. It passes the object on the left hand side to the function on the right and makes the code more readable. We will see more of this later.

Now let’s look at the different row-wise filtering options.

> dplyr::filter(starwars,mass<=20)

starwars

And I get what I wanted to see! Yoda’s mass is 17, so the data is included in the results.

If you just want to view a set of rows say from 14 to 15, then we do the following:

> dplyr::slice(starwars, 13:14)

And look who we have here! Han Solo and Chewbacca.

starwars

Here is a column-wise manipulation:

> dplyr::select(starwars, name, species, birth_year)

starwars

We can also do summary and grouping operations. For example if we want to see the average mass by gender, then:

> starwars %>%
 group_by(gender) %>%
 summarise(avg = mean(mass, na.rm=TRUE)) %>%
 arrange(avg)

starwars

We can also create new variables. Say I want to create a BMI value. For the sake of simplicity let’s ignore gender and set the formula to be weight / height(squared).

> starwars %>%
mutate(bmi = mass/(height^2)) %>%
arrange(desc(bmi)) %>%
select(name, species, height, mass, bmi)

starwars

I am not going to talk about the above results!

We can also reshape some results using tidyr. Here is an example to append species and name.

> starwars %>%
unite(name_species, name, species) %>% 
select(name_species)

starwars

References: This amazing cheat sheet.

Predictive modelling of Iris Species with Caret

The Iris dataset was used in Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. It includes three iris species with 50 samples each as well as some properties about each flower. In this blog, I will use the caret package from R to predict the species class of various Iris flowers. Let’s jump into the code.

Calling/Invoking all the necessary packages in R

#Calling libraries
> library(AppliedPredictiveModeling)
> library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

> library(pROC)

## Type 'citation("pROC")' for a citation.

## 
 ## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
 ## 
 ##     cov, smooth, var

The Iris dataframe is already included in R which we can attach using the data() command. A quick summary() and head() also gives us a nice introduction to the dataset. We can also set a seed value to output reproducable results.

#Prepare the dataset
> data(iris)
> summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 ##        Species  
 ##  setosa    :50  
 ##  versicolor:50  
 ##  virginica :50  
 ##                 
 ##                 
 ##

> head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 ## 1          5.1         3.5          1.4         0.2  setosa
 ## 2          4.9         3.0          1.4         0.2  setosa
 ## 3          4.7         3.2          1.3         0.2  setosa
 ## 4          4.6         3.1          1.5         0.2  setosa
 ## 5          5.0         3.6          1.4         0.2  setosa
 ## 6          5.4         3.9          1.7         0.4  setosa

> nrow(iris)

## [1] 150

> set.seed(9999)

The following plot command from the caret package can show us how the Species values are distributed in a pairwise plot. An observation from this plot is that Versicolor and Virginica have similar patterns and Setosa is quite distinct. An intuition is that we can predict Setosa easily and might have some challenges in Versicolor and Virginica.

#Visualising the dataset
> transparentTheme(trans = .4)
> featurePlot(x = iris[, 1:4], 
             y = iris$Species, 
             plot = "ellipse",
             auto.key = list(columns = 3))

Screen Shot 2017-09-29 at 17.04.59

Now we split up the dataset into a train and a test partition. We use 80% of the dataset to use into train and the remaining 20% into test. We also define a 10 fold cross validation method to be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

#Split into train and test dataset
> trainIndex <- createDataPartition(iris$Species, p = .8,
 list = FALSE,
 times = 1)
> train <- iris[ trainIndex,]
> test  <- iris[-trainIndex,]

> nrow(train)

## [1] 120

> nrow(test)

## [1] 30

#Cross validation
> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

Our problem is a Classification problem since our output predictor variable is a class. We have several machine learning algorithms for class prediction. We implement 3 methods here and compare them to find the best fit.

First fit a decision tree model (Rpart) to the train dataset. A decision tree is a model with an “if this then that approach” and it is easy & fast to intepret with visualising the tree.

We also preprocess the dataset so that any variable with very different range of values will not affect our outcome.

> dt.fit <- train(Species ~ ., data = train,
 method = "rpart",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: rpart

> dt.fit

## CART
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   cp    Accuracy   Kappa
 ##   0.00  0.9550000  0.9325
 ##   0.45  0.7683333  0.6525
 ##   0.50  0.3333333  0.0000
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was cp = 0.

Accuracy is the percentage of correctly classifies instances out of all instances. From the above results we can see that the model has a good accuracy value. Kappa is similar to accuracy but it is more based on a normalised random draw of the dataset, i.e, it would be more useful for class imbalanced classifications. CP is the complexity parameter which is used to control the decision tree’s size and choose the optimal tree size. We can see that the model has chosen the tree size with the best accuracy.

Next, we predict the test dataset using the trained model. A confusion matrix is used to understand the performance of the model. It is a table wise comparison of actual and predicted values. A variable importance plot shows that Petal width is the most important variable that has helped us predict the Species class.

> predictions <- predict(dt.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(dt.fit))

Screen Shot 2017-09-29 at 17.10.34

The decision tree has predicted an accuracy of 93%

The second algorithm we use is the K – Nearest neighbor algorithm. This is a great method for classification of the iris dataset. In simple words, it takes inputs from the neighborhood data points and predicts the test data with confidence. K is the number of segments and the algorithm has chosen the best K value based on accuracy.

> knn.fit <- train(Species ~ ., data = train,
 method = "knn",
 trControl = fitControl,
 preProcess=c("center", "scale"))

> knn.fit

## k-Nearest Neighbors
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   k  Accuracy  Kappa
 ##   5  0.955     0.9325
 ##   7  0.955     0.9325
 ##   9  0.945     0.9175
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was k = 7.

> predictions <- predict(knn.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(knn.fit))

Screen Shot 2017-09-29 at 17.11.16

The KNN predicts with an accuracy of 93%

The final method we use is the Random Forest method. This method uses a set of decision trees to aggregate the final results. This way we can minimize error caused from individual decision trees. The mtry value is the number of variables available for splitting of each tree node. Here again the optimal value model is selected based on accuracy.

> rf.fit <- train(Species ~ ., data = train,
 method = "rf",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

##
 ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
 ##
 ##     margin

> rf.fit

## Random Forest
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   mtry  Accuracy   Kappa
 ##   2     0.9533333  0.9300
 ##   3     0.9616667  0.9425
 ##   4     0.9616667  0.9425
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was mtry = 3.

> predictions <- predict(rf.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(rf.fit))

Screen Shot 2017-09-29 at 17.11.42.png

The random forest predicts with an accuracy of 93%

From the above analysis we can see that all models have performed very well and therefore we can use the predictions from either of the model.

Predict if a client will subscribe to a term deposit using decision trees

A term deposit is a deposit with a specified period of maturity and earns interest. It is a money deposit at a banking institution that cannot be withdrawn for a specific term or period of time (unless a penalty is paid).  [Source: Wikipedia]

Predicting if a client will subscribe to a term deposit can help increase the efficiency of a marketing campaign and help us understand the factors that influence a successful outcome (subscription) from a client.

In this blog post, I will be using the bank marketing data set from the UCI Machine Learning Repository that can be downloaded from here. Let’s get started by invoking the necessary packages in R.

#Call necessary libraries
> library(rpart)
> library(caret)
> library(AppliedPredictiveModeling)

Now, let’s read the input file in CSV format specifying the path to the file. We also specify the header and separators using the read.csv function.

#Read the input file
> getwd()
> setwd("/Users/<path-to-csv>")
> bank<-read.csv("bank.csv",header=TRUE, sep=";")

The first thing to do is to explore the dataset

#Understand the structure of the dataset
> names(bank)
 [1] "age" "job" "marital" "education" "default" "balance" "housing" 
 [8] "loan" "contact" "day" "month" "duration" "campaign" "pdays" 
[15] "previous" "poutcome" "y" 
> nrow(bank)
[1] 4521
> ncol(bank)
[1] 17
> str(bank)
'data.frame': 4521 obs. of  17 variables:'data.frame': 4521 obs. of  17 variables: 
$ age      : int  30 33 35 30 59 35 36 39 41 43 ... 
$ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ... 
$ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ... 
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ... 
$ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... 
$ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ... 
$ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ... 
$ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ... 
$ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ... 
$ day      : int  19 11 16 3 5 23 14 6 14 17 ... 
$ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ... 
$ duration : int  79 220 185 199 226 141 341 151 57 313 ... 
$ campaign : int  1 1 1 4 1 2 1 2 2 1 ... 
$ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ... 
$ previous : int  0 4 1 0 0 3 2 0 0 2 ... 
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ... 
$ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

As we can see, the dataset has 17 variables and 4521 records. The target variable y has a binary outcome for the subscription status of the client (“Yes” or “No”). Because of this we will look into a classification type algorithm. All other details of the input variables can be found in the UCI website.

On initial examination all the input variables seem to important to impact the client’s decision, but how much can human intuition be useful? Let’s test it out using some initial exploratory visualisations.

> transparentTheme(trans = .4)
> featurePlot(x = bank[, c(1,6,12,13,14,15)], 
 y = bank$y, 
 plot = "pairs",
 auto.key = list(columns = 2))

Screen Shot 2017-09-26 at 18.18.14

From the above plot we can see how some variables impact other variables. The age and balance are related so we can see that lower the age higher the balance. Campaign and age variables show a relation pattern as well, but how do all these variables impact the outcome? By looking at the same graph colour-wise (red being a “no” and blue being a “yes” to subscription) we are also able to see that younger people have more “no” outcomes than “yes”. We can also plot the target variable by age to understand the distribution.

>boxplot(bank$age~bank$y, ylab="Age",xlab="y")

Screen Shot 2017-10-15 at 23.41.05

Now let’s focus on the target variable.

> table(bank$y)

no yes 
4000 521

> table(bank$y)/nrow(bank)

no yes 
0.88476 0.11524

From the above we can also see that the class variable is unbalanced. For the sake of simplicity in this blog, we do not address this issue here.

Now, let us begin the machine learning analysis. First we set the seed value so that we get the same results for the trained model every time we run it. Then, we split the dataset into training and test sets. We allocate 60% of the data size into the training set and the remaining 40% to the test set.

> set.seed(999)
> trainIndex <- createDataPartition(bank$y, p = .6, 
 list = FALSE, 
 times = 1)
> train <- bank[ trainIndex,]
> test <- bank[-trainIndex,]

We create a 10 fold cross-validation method to help train the dataset. It will be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

The data is now ready to be trained. We are using a recursive partitioning and regression tree method for the purpose of this blog.

Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached. [Source: Wikipedia]

> fit <- train(y ~ ., data = train, 
 method = "rpart", 
 trControl = fitControl)

Let us now look at the results of the trained model.

> fit
CART

2713 samples
 16 predictor
 2 classes: 'no', 'yes'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 2442, 2441, 2441, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

cp Accuracy Kappa 
 0.02236422 0.8951085 0.3912704
 0.05111821 0.8860340 0.2088233
 0.05271565 0.8850749 0.1745803

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02236422.

The next step is to use the trained machine learning model to predict the test dataset.

> predictions <- predict(fit, test)

The results of our predictions can be viewed using a confusion matrix – which is a table of the actual and predicted values.

> conf.mat <- table(test$y,predictions)
> acc <- sum(diag(conf.mat))/sum(conf.mat)
> acc

[1] 0.8904867

Our model has reached an accuracy of 89% which is a good score. An ROC plot can also illustrate our results better.

Possible extensions to our project include fitting different models and comparing individual accuracies against each other.

References: 

http://www.columbia.edu/~jc4133/ADA-Project.pdf

https://rpubs.com/nadi27oct/259406

Time Series Analysis of Monthly Milk Production

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. [Source: Wikipedia]

Time series analysis can be useful to identify and separate trends, cycles and seasonality of a dataset. In this blog I will illustrate how I analyse time series datasets in R and also show methods to forecast the data.

Let’s look at the Monthly Milk Production dataset from datamarket.com. The time series dataset measures pounds per cow as its unit per month from January 1962 to December 1975.

The dataset can be downloaded from this link. Once we download the CSV file and place it in the working directory, we can read the file using the following code.

> library(forecast)
> milk <- read.csv("monthly-milk-production-pounds-p.csv")

Now we can transform the input to a time series by giving the frequency and start month/year. We can also print out the time series as seen below.

> milk.ts <- ts(milk, frequency=12, start=c(1962,1))
> milk.ts
 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1962 561 640 656 727 697 640 599 568 577 553 582 600
1963 566 653 673 742 716 660 617 583 587 565 598 628
1964 618 688 705 770 736 678 639 604 611 594 634 658
1965 622 709 722 782 756 702 653 615 621 602 635 677
1966 635 736 755 811 798 735 697 661 667 645 688 713
1967 667 762 784 837 817 767 722 681 687 660 698 717
1968 696 775 796 858 826 783 740 701 706 677 711 734
1969 690 785 805 871 845 801 764 725 723 690 734 750
1970 707 807 824 886 859 819 783 740 747 711 751 804
1971 756 860 878 942 913 869 834 790 800 763 800 826
1972 799 890 900 961 935 894 855 809 810 766 805 821
1973 773 883 898 957 924 881 837 784 791 760 802 828
1974 778 889 902 969 947 908 867 815 812 773 813 834
1975 782 892 903 966 937 896 858 817 827 797 843

Let’s plot out the time series.

> plot.ts(milk.ts)

Screen Shot 2017-10-27 at 23.17.52

We can see that there is a seasonal variation in the time series within a year. Also it seems to be like an additive model as the seasonal fluctuations are roughly constant over time. If the seasonal fluctuations seem to increase in the level of time series then it can denote a multiplicative model (which is not our case here).

Let’s now decompose the series into its constituents which are a trend component, an irregular component and a seasonal component (if present). The irregular component is the remainder of the time series once decomposed.

> plot(decompose(milk.ts))

Screen Shot 2017-10-27 at 23.24.28.png

If we have a time series using an additive model with an increasing or decreasing trend and seasonality, we can use Holt-Winters exponential smoothing to make short-term forecasts. To fit a predictive model for the log of the monthly milk production we write the following code.

> logmilk.ts <- log(milk.ts)
> milk.ts.forecast <- HoltWinters(log(milk.ts))
> milk.ts.forecast

Holt-Winters exponential smoothing with trend and additive seasonal component.

Call:
HoltWinters(x = log(milk.ts))

Smoothing parameters:
 alpha: 0.587315
 beta : 0
 gamma: 1

Coefficients:
 [,1]
a 6.788338238
b 0.002087539
s1 -0.031338300
s2 -0.090489288
s3 0.043463485
s4 0.058024146
s5 0.127891508
s6 0.100852168
s7 0.057336548
s8 0.011032882
s9 -0.047972067
s10 -0.047584275
s11 -0.097105193
s12 -0.051371280

> plot(milk.ts.forecast)

As for simple exponential smoothing and Holt’s exponential smoothing, we can plot the original time series as a black line, with the forecasted values as a red line on top of that.

Screen Shot 2017-10-27 at 23.28.33.png

To make forecasts for future times not included in the original time series, we use the “forecast()” function in the “forecast” package.

> milk.ts.forecast2 <- forecast(milk.ts.forecast, h=48)
> plot(milk.ts.forecast2)

Screen Shot 2017-10-27 at 23.36.41

The forecasts are shown in the dark blue line and the grey areas are the confidence intervals.

References: http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html

https://www.statmethods.net/advstats/timeseries.html

Customer Segmentation using K-Means Clustering

Market segmentation is the process of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers (known as segments) based on some type of shared characteristics. In dividing or segmenting markets, researchers typically look for common characteristics such as shared needs, common interests, similar lifestyles or even similar demographic profiles. The overall aim of segmentation is to identify high yield segments – that is, those segments that are likely to be the most profitable or that have growth potential – so that these can be selected for special attention (i.e. become target markets). [Source: Wikipedia]

First a little introduction of the K-Means algorithm which I am going to implement here. The following process describes how this method is implemented.

1. k initial “means” are randomly generated within the data domain.

2. k clusters are created by associating every observation with the nearest mean.

3. The centroid of each of the k clusters becomes the new mean.

4. Steps 2 and 3 are repeated until convergence has been reached.

[Source: Wikipedia]

In this blog I am looking to implement a K-Means clustering algorithm to a dataset from the trusted UCI Machine Learning Repository. The Wholesale Customers dataset consists of 8 attributes and 440 Instances. It can be downloaded from here.

Attribute Information:

1) FRESH: annual spending (m.u.) on fresh products (Continuous);
2) MILK: annual spending (m.u.) on milk products (Continuous);
3) GROCERY: annual spending (m.u.)on grocery products (Continuous);
4) FROZEN: annual spending (m.u.)on frozen products (Continuous)
5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
7) CHANNEL: customers’ Channel – Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
8) REGION: customers’ Region – Lisnon, Oporto or Other (Nominal)
Descriptive Statistics:

(Minimum, Maximum, Mean, Std. Deviation)
FRESH ( 3, 112151, 12000.30, 12647.329)
MILK (55, 73498, 5796.27, 7380.377)
GROCERY (3, 92780, 7951.28, 9503.163)
FROZEN (25, 60869, 3071.93, 4854.673)
DETERGENTS_PAPER (3, 40827, 2881.49, 4767.854)
DELICATESSEN (3, 47943, 1524.87, 2820.106)

REGION Frequency
Lisbon 77
Oporto 47
Other Region 316
Total 440

CHANNEL Frequency
Horeca 298
Retail 142
Total 440

Now let us jump straight into the code.

#Call necessary libraries
> library(corrplot)
#Read the input file
> getwd()
> setwd("<path to file>")
> wholecust<-read.csv("Wholesalecustomersdata.csv",header=TRUE, sep=",")

Now, let’s understand the dataset by having a peak of the data

> head(wholecust)
 Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451

We can also get a useful summary from the dataset

> summary(wholecust)
 Channel Region Fresh Milk Grocery 
 Min. :1.000 Min. :1.000 Min. : 3 Min. : 55 Min. : 3 
 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 
 Median :1.000 Median :3.000 Median : 8504 Median : 3627 Median : 4756 
 Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796 Mean : 7951 
 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 
 Max. :2.000 Max. :3.000 Max. :112151 Max. :73498 Max. :92780 
 Frozen Detergents_Paper Delicassen 
 Min. : 25.0 Min. : 3.0 Min. : 3.0 
 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2 
 Median : 1526.0 Median : 816.5 Median : 965.5 
 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9 
 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2 
 Max. :60869.0 Max. :40827.0 Max. :47943.0

Let us now explore the dataset more and find out strong correlations among variables.

#Identify strong correlations
>w <- cor(wholecust)
>corrplot(w, method="number")

Screen Shot 2017-10-16 at 00.15.29

From the above diagram we can see that there is strong correlation among the Detergents_Paper and Grocery.

Now we need to find the optimal number of cluster K. Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.  [Source: Wikipedia]

>set.seed(123)
>k.max <- 15
>data <- as.matrix(scale(wholecust[,(3:8)]))

>wss <- sapply(1:k.max, 
 function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})

>plot(1:k.max, wss,
 type="b", pch = 19, frame = FALSE, 
 xlab="Number of clusters",
 ylab="Sum of squares")

Screen Shot 2017-10-16 at 00.47.56

From the above plot we can see that 3 or 5 is the optimal number of clusters, as we can see that after these numbers the curve remains less changing.

Let us plot out the clusters for both the scenarios

>kmm <- kmeans(wholecust[,(3:8)], 3)
>kmm


K-means clustering with 3 clusters of sizes 60, 330, 50

Cluster means:
 Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
2 8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
3 8000.04 18511.420 27573.900 1996.680 12407.360 2252.020

Clustering vector:
 [1] 2 2 2 2 1 2 2 2 2 3 2 2 1 2 1 2 2 2 2 2 2 2 1 3 1 2 2 2 3 1 2 2 2 1 2 2 1 2 3 1 1 2
 [43] 2 3 2 3 3 3 2 3 2 2 1 2 1 2 3 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2
 [85] 2 3 3 1 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 1 1
[127] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 3 2 2 2 1 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 3 2 2
[169] 2 2 2 3 2 3 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 3 3 1 2 2 3 2 2 2 3
[211] 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 3
[253] 2 1 2 1 2 2 1 1 2 2 1 2 2 3 3 2 3 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2
[295] 2 2 2 2 2 2 2 3 2 2 3 2 3 2 2 3 2 1 3 2 2 2 2 2 2 3 2 2 2 2 1 1 2 2 2 2 2 3 2 3 2 1
[337] 2 2 2 2 2 2 2 3 2 2 2 1 2 3 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1
[379] 2 2 1 2 1 2 3 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 2 1 3 2 2 2 2 2 2 2 2 2 2 3 2
[421] 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1 1 3 2 2

Within cluster sum of squares by cluster:
[1] 25765310312 28184318853 26382784678
 (between_SS / total_SS = 49.0 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

Now let’s implement K means for k=5

> kmm <- kmeans(wholecust[,(3:8)], 5)
> kmm
K-means clustering with 5 clusters of sizes 223, 23, 104, 80, 10

Cluster means:
 Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 6052.812 3266.314 4092.054 2459.682 1214.1300 990.6099
2 49296.087 4983.783 5590.304 8285.783 962.2609 2543.6957
3 21200.058 3886.423 5138.933 4119.856 1131.5192 1690.3365
4 4738.762 11609.013 18198.775 1515.388 8003.7750 1603.8000
5 21263.700 37443.300 46710.600 6287.200 21699.4000 8743.3000

Clustering vector:
 [1] 1 1 1 1 3 1 1 1 1 4 4 1 3 3 3 1 4 1 3 1 3 1 3 5 3 3 1 3 4 2 3 1 3 3 1 1 3 4 4 2 3 3
 [43] 4 4 1 4 4 5 1 4 1 1 2 4 3 1 4 4 1 1 1 5 1 4 1 5 1 3 1 1 3 3 1 3 1 3 1 4 1 1 1 4 4 3
 [85] 1 5 5 2 1 3 1 1 5 3 4 1 1 1 1 1 4 4 1 2 3 3 4 4 1 4 1 4 3 3 3 1 1 1 3 1 3 1 1 1 2 2
[127] 3 3 1 2 1 1 3 1 1 1 1 1 1 1 3 3 2 1 3 4 1 1 1 3 3 1 3 1 1 4 4 3 1 4 1 1 3 4 1 4 1 1
[169] 1 1 4 4 1 4 1 4 2 1 1 1 1 2 4 5 1 1 1 1 4 4 3 1 1 4 1 3 3 1 1 1 4 4 3 1 1 4 1 1 1 4
[211] 3 5 1 1 4 4 4 3 4 1 3 1 1 1 1 1 3 1 1 1 1 1 3 1 3 1 1 3 1 2 3 3 3 1 1 4 1 1 3 1 1 4
[253] 1 3 1 3 1 1 2 2 1 1 3 1 4 4 4 3 4 3 1 1 1 2 1 1 3 1 1 3 1 1 2 3 2 2 1 3 3 2 1 1 1 4
[295] 3 1 3 1 1 1 3 4 1 4 4 4 4 3 1 4 1 3 4 1 1 4 1 1 1 4 1 1 3 1 3 2 1 1 3 1 1 4 3 5 3 3
[337] 1 1 1 1 1 1 1 4 1 1 4 3 1 4 1 4 1 4 3 1 3 4 1 1 3 1 1 1 1 1 1 1 3 1 2 3 1 3 1 1 4 2
[379] 1 1 3 3 3 1 4 1 1 3 1 1 1 1 1 3 1 1 4 1 1 1 1 3 3 3 3 1 3 4 1 1 1 1 1 1 1 1 4 1 4 1
[421] 4 3 3 3 3 1 4 3 1 1 4 1 3 1 3 3 2 4 1 1

Within cluster sum of squares by cluster:
[1] 10060038988 11679101316 8521349738 8835879467 14108802241
 (between_SS / total_SS = 66.2 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

Let us closely look at the cluster means of both scenarios:

Scenario 1: k = 3

Screen Shot 2017-10-16 at 00.52.06

  • Cluster 1 – highest fresh-products.
  • Cluster 2 – low spenders.
  • Cluster 3 – highest milk, grocery, detergents_papers spenders.

Scenario 2: k = 5

Screen Shot 2017-10-16 at 00.54.11

  • Cluster 1 – low spenders
  • Cluster 2 – highest Fresh spenders
  • Cluster 3 – mediocre spenders
  • Cluster 4 – low spenders
  • Cluster 5 – mediocre Fresh, highest milk, Grocery, detergents_papers

From the above 2 analysis we can see that 3 clusters prove to be the base optimal number for quickly understanding the customer segmentation.

Possible extensions to the project is to use Principal Component Analysis for dimensionality reduction.