Predictive modelling with Caret

The Iris dataset was used in Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. It includes three iris species with 50 samples each as well as some properties about each flower. In this blog, I will use the caret package from R to predict the species class of various Iris flowers. Let’s jump into the code.

Calling/Invoking all the necessary packages in R

#Calling libraries
> library(AppliedPredictiveModeling)
> library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

> library(pROC)

## Type 'citation("pROC")' for a citation.

## 
 ## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
 ## 
 ##     cov, smooth, var

The Iris dataframe is already included in R which we can attach using the data() command. A quick summary() and head() also gives us a nice introduction to the dataset. We can also set a seed value to output reproducable results.

#Prepare the dataset
> data(iris)
> summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 ##        Species  
 ##  setosa    :50  
 ##  versicolor:50  
 ##  virginica :50  
 ##                 
 ##                 
 ##

> head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 ## 1          5.1         3.5          1.4         0.2  setosa
 ## 2          4.9         3.0          1.4         0.2  setosa
 ## 3          4.7         3.2          1.3         0.2  setosa
 ## 4          4.6         3.1          1.5         0.2  setosa
 ## 5          5.0         3.6          1.4         0.2  setosa
 ## 6          5.4         3.9          1.7         0.4  setosa

> nrow(iris)

## [1] 150

> set.seed(9999)

The following plot command from the caret package can show us how the Species values are distributed in a pairwise plot. An observation from this plot is that Versicolor and Virginica have similar patterns and Setosa is quite distinct. An intuition is that we can predict Setosa easily and might have some challenges in Versicolor and Virginica.

#Visualising the dataset
> transparentTheme(trans = .4)
> featurePlot(x = iris[, 1:4], 
             y = iris$Species, 
             plot = "ellipse",
             auto.key = list(columns = 3))

Screen Shot 2017-09-29 at 17.04.59

Now we split up the dataset into a train and a test partition. We use 80% of the dataset to use into train and the remaining 20% into test. We also define a 10 fold cross validation method to be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

#Split into train and test dataset
> trainIndex <- createDataPartition(iris$Species, p = .8,
 list = FALSE,
 times = 1)
> train <- iris[ trainIndex,]
> test  <- iris[-trainIndex,]

> nrow(train)

## [1] 120

> nrow(test)

## [1] 30

#Cross validation
> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

Our problem is a Classification problem since our output predictor variable is a class. We have several machine learning algorithms for class prediction. We implement 3 methods here and compare them to find the best fit.

First fit a decision tree model (Rpart) to the train dataset. A decision tree is a model with an “if this then that approach” and it is easy & fast to intepret with visualising the tree.

We also preprocess the dataset so that any variable with very different range of values will not affect our outcome.

> dt.fit <- train(Species ~ ., data = train,
 method = "rpart",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: rpart

> dt.fit

## CART
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   cp    Accuracy   Kappa
 ##   0.00  0.9550000  0.9325
 ##   0.45  0.7683333  0.6525
 ##   0.50  0.3333333  0.0000
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was cp = 0.

Accuracy is the percentage of correctly classifies instances out of all instances. From the above results we can see that the model has a good accuracy value. Kappa is similar to accuracy but it is more based on a normalised random draw of the dataset, i.e, it would be more useful for class imbalanced classifications. CP is the complexity parameter which is used to control the decision tree’s size and choose the optimal tree size. We can see that the model has chosen the tree size with the best accuracy.

Next, we predict the test dataset using the trained model. A confusion matrix is used to understand the performance of the model. It is a table wise comparison of actual and predicted values. A variable importance plot shows that Petal width is the most important variable that has helped us predict the Species class.

> predictions <- predict(dt.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(dt.fit))

Screen Shot 2017-09-29 at 17.10.34

The decision tree has predicted an accuracy of 93%

The second algorithm we use is the K – Nearest neighbor algorithm. This is a great method for classification of the iris dataset. In simple words, it takes inputs from the neighborhood data points and predicts the test data with confidence. K is the number of segments and the algorithm has chosen the best K value based on accuracy.

> knn.fit <- train(Species ~ ., data = train,
 method = "knn",
 trControl = fitControl,
 preProcess=c("center", "scale"))

> knn.fit

## k-Nearest Neighbors
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   k  Accuracy  Kappa
 ##   5  0.955     0.9325
 ##   7  0.955     0.9325
 ##   9  0.945     0.9175
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was k = 7.

> predictions <- predict(knn.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(knn.fit))

Screen Shot 2017-09-29 at 17.11.16

The KNN predicts with an accuracy of 93%

The final method we use is the Random Forest method. This method uses a set of decision trees to aggregate the final results. This way we can minimize error caused from individual decision trees. The mtry value is the number of variables available for splitting of each tree node. Here again the optimal value model is selected based on accuracy.

> rf.fit <- train(Species ~ ., data = train,
 method = "rf",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

##
 ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
 ##
 ##     margin

> rf.fit

## Random Forest
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   mtry  Accuracy   Kappa
 ##   2     0.9533333  0.9300
 ##   3     0.9616667  0.9425
 ##   4     0.9616667  0.9425
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was mtry = 3.

> predictions <- predict(rf.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(rf.fit))

Screen Shot 2017-09-29 at 17.11.42.png

The random forest predicts with an accuracy of 93%

From the above analysis we can see that all models have performed very well and therefore we can use the predictions from either of the model.

Advertisements

Time Series Analysis of Monthly Milk Production

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. [Source: Wikipedia]

Time series analysis can be useful to identify and separate trends, cycles and seasonality of a dataset. In this blog I will illustrate how I analyse time series datasets in R and also show methods to forecast the data.

Let’s look at the Monthly Milk Production dataset from datamarket.com. The time series dataset measures pounds per cow as its unit per month from January 1962 to December 1975.

The dataset can be downloaded from this link. Once we download the CSV file and place it in the working directory, we can read the file using the following code.

> library(forecast)
> milk <- read.csv("monthly-milk-production-pounds-p.csv")

Now we can transform the input to a time series by giving the frequency and start month/year. We can also print out the time series as seen below.

> milk.ts <- ts(milk, frequency=12, start=c(1962,1))
> milk.ts
 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1962 561 640 656 727 697 640 599 568 577 553 582 600
1963 566 653 673 742 716 660 617 583 587 565 598 628
1964 618 688 705 770 736 678 639 604 611 594 634 658
1965 622 709 722 782 756 702 653 615 621 602 635 677
1966 635 736 755 811 798 735 697 661 667 645 688 713
1967 667 762 784 837 817 767 722 681 687 660 698 717
1968 696 775 796 858 826 783 740 701 706 677 711 734
1969 690 785 805 871 845 801 764 725 723 690 734 750
1970 707 807 824 886 859 819 783 740 747 711 751 804
1971 756 860 878 942 913 869 834 790 800 763 800 826
1972 799 890 900 961 935 894 855 809 810 766 805 821
1973 773 883 898 957 924 881 837 784 791 760 802 828
1974 778 889 902 969 947 908 867 815 812 773 813 834
1975 782 892 903 966 937 896 858 817 827 797 843

Let’s plot out the time series.

> plot.ts(milk.ts)

Screen Shot 2017-10-27 at 23.17.52

We can see that there is a seasonal variation in the time series within a year. Also it seems to be like an additive model as the seasonal fluctuations are roughly constant over time. If the seasonal fluctuations seem to increase in the level of time series then it can denote a multiplicative model (which is not our case here).

Let’s now decompose the series into its constituents which are a trend component, an irregular component and a seasonal component (if present). The irregular component is the remainder of the time series once decomposed.

> plot(decompose(milk.ts))

Screen Shot 2017-10-27 at 23.24.28.png

If we have a time series using an additive model with an increasing or decreasing trend and seasonality, we can use Holt-Winters exponential smoothing to make short-term forecasts. To fit a predictive model for the log of the monthly milk production we write the following code.

> logmilk.ts <- log(milk.ts)
> milk.ts.forecast <- HoltWinters(log(milk.ts))
> milk.ts.forecast

Holt-Winters exponential smoothing with trend and additive seasonal component.

Call:
HoltWinters(x = log(milk.ts))

Smoothing parameters:
 alpha: 0.587315
 beta : 0
 gamma: 1

Coefficients:
 [,1]
a 6.788338238
b 0.002087539
s1 -0.031338300
s2 -0.090489288
s3 0.043463485
s4 0.058024146
s5 0.127891508
s6 0.100852168
s7 0.057336548
s8 0.011032882
s9 -0.047972067
s10 -0.047584275
s11 -0.097105193
s12 -0.051371280

> plot(milk.ts.forecast)

As for simple exponential smoothing and Holt’s exponential smoothing, we can plot the original time series as a black line, with the forecasted values as a red line on top of that.

Screen Shot 2017-10-27 at 23.28.33.png

To make forecasts for future times not included in the original time series, we use the “forecast()” function in the “forecast” package.

> milk.ts.forecast2 <- forecast(milk.ts.forecast, h=48)
> plot(milk.ts.forecast2)

Screen Shot 2017-10-27 at 23.36.41

The forecasts are shown in the dark blue line and the grey areas are the confidence intervals.

References: http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html

https://www.statmethods.net/advstats/timeseries.html

Predict if a client will subscribe to a term deposit using decision trees

A term deposit is a deposit with a specified period of maturity and earns interest. It is a money deposit at a banking institution that cannot be withdrawn for a specific term or period of time (unless a penalty is paid).  [Source: Wikipedia]

Predicting if a client will subscribe to a term deposit can help increase the efficiency of a marketing campaign and help us understand the factors that influence a successful outcome (subscription) from a client.

In this blog post, I will be using the bank marketing data set from the UCI Machine Learning Repository that can be downloaded from here. Let’s get started by invoking the necessary packages in R.

#Call necessary libraries
> library(rpart)
> library(caret)
> library(AppliedPredictiveModeling)

Now, let’s read the input file in CSV format specifying the path to the file. We also specify the header and separators using the read.csv function.

#Read the input file
> getwd()
> setwd("/Users/<path-to-csv>")
> bank<-read.csv("bank.csv",header=TRUE, sep=";")

The first thing to do is to explore the dataset

#Understand the structure of the dataset
> names(bank)
 [1] "age" "job" "marital" "education" "default" "balance" "housing" 
 [8] "loan" "contact" "day" "month" "duration" "campaign" "pdays" 
[15] "previous" "poutcome" "y" 
> nrow(bank)
[1] 4521
> ncol(bank)
[1] 17
> str(bank)
'data.frame': 4521 obs. of  17 variables:'data.frame': 4521 obs. of  17 variables: 
$ age      : int  30 33 35 30 59 35 36 39 41 43 ... 
$ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ... 
$ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ... 
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ... 
$ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... 
$ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ... 
$ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ... 
$ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ... 
$ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ... 
$ day      : int  19 11 16 3 5 23 14 6 14 17 ... 
$ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ... 
$ duration : int  79 220 185 199 226 141 341 151 57 313 ... 
$ campaign : int  1 1 1 4 1 2 1 2 2 1 ... 
$ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ... 
$ previous : int  0 4 1 0 0 3 2 0 0 2 ... 
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ... 
$ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

As we can see, the dataset has 17 variables and 4521 records. The target variable y has a binary outcome for the subscription status of the client (“Yes” or “No”). Because of this we will look into a classification type algorithm. All other details of the input variables can be found in the UCI website.

On initial examination all the input variables seem to important to impact the client’s decision, but how much can human intuition be useful? Let’s test it out using some initial exploratory visualisations.

> transparentTheme(trans = .4)
> featurePlot(x = bank[, c(1,6,12,13,14,15)], 
 y = bank$y, 
 plot = "pairs",
 auto.key = list(columns = 2))

Screen Shot 2017-09-26 at 18.18.14

From the above plot we can see how some variables impact other variables. The age and balance are related so we can see that lower the age higher the balance. Campaign and age variables show a relation pattern as well, but how do all these variables impact the outcome? By looking at the same graph colour-wise (red being a “no” and blue being a “yes” to subscription) we are also able to see that younger people have more “no” outcomes than “yes”. We can also plot the target variable by age to understand the distribution.

>boxplot(bank$age~bank$y, ylab="Age",xlab="y")

Screen Shot 2017-10-15 at 23.41.05

Now let’s focus on the target variable.

> table(bank$y)

no yes 
4000 521

> table(bank$y)/nrow(bank)

no yes 
0.88476 0.11524

From the above we can also see that the class variable is unbalanced. For the sake of simplicity in this blog, we do not address this issue here.

Now, let us begin the machine learning analysis. First we set the seed value so that we get the same results for the trained model every time we run it. Then, we split the dataset into training and test sets. We allocate 60% of the data size into the training set and the remaining 40% to the test set.

> set.seed(999)
> trainIndex <- createDataPartition(bank$y, p = .6, 
 list = FALSE, 
 times = 1)
> train <- bank[ trainIndex,]
> test <- bank[-trainIndex,]

We create a 10 fold cross-validation method to help train the dataset. It will be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

The data is now ready to be trained. We are using a recursive partitioning and regression tree method for the purpose of this blog.

Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached. [Source: Wikipedia]

> fit <- train(y ~ ., data = train, 
 method = "rpart", 
 trControl = fitControl)

Let us now look at the results of the trained model.

> fit
CART

2713 samples
 16 predictor
 2 classes: 'no', 'yes'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 2442, 2441, 2441, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

cp Accuracy Kappa 
 0.02236422 0.8951085 0.3912704
 0.05111821 0.8860340 0.2088233
 0.05271565 0.8850749 0.1745803

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02236422.

The next step is to use the trained machine learning model to predict the test dataset.

> predictions <- predict(fit, test)

The results of our predictions can be viewed using a confusion matrix – which is a table of the actual and predicted values.

> conf.mat <- table(test$y,predictions)
> acc <- sum(diag(conf.mat))/sum(conf.mat)
> acc

[1] 0.8904867

Our model has reached an accuracy of 89% which is a good score. An ROC plot can also illustrate our results better.

Possible extensions to our project include fitting different models and comparing individual accuracies against each other.

References: 

http://www.columbia.edu/~jc4133/ADA-Project.pdf

https://rpubs.com/nadi27oct/259406

Linear Regression in R

Linear regression is a fundamental method used as part of Regression analysis. It is a quick and an easy way to understand how predictive algorithms work in general and in the field of machine learning.

To give a simple overview of how the algorithm works, a linear regression fits a linear relationship for a given target variable(Y) using one or many scalar dependent variables(X).

For example consider the age, height and weight of children. By intuition we can say that these variables are correlated. With increase in height and age there is also an increase in weight. A linear regression would fit a model or an equation for the target variable assuming the relation between X and Y is linear. In many practical applications a linear regression model works effectively.

Linear_regression.svg

See figure above of an example of a linear regression model. The red line indicates a linear regression model of a target variable (Y axis) with one dependent variable (X axis)

Let’s perform a linear regression analysis using the trees dataset in R

First let’s attach the dataset in the R workspace and have a sneak peak of how the data looks.

> data(trees)
> head(trees)
  Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
> str(trees)
'data.frame': 31 obs. of 3 variables:
 $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
> summary(trees)
 Girth Height Volume 
 Min. : 8.30 Min. :63 Min. :10.20 
 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40 
 Median :12.90 Median :76 Median :24.20 
 Mean :13.25 Mean :76 Mean :30.17 
 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30 
 Max. :20.60 Max. :87 Max. :77.00

The head() function gives us the a top sample of dataset. We can see the dataset has 3 variables namely the Girth, Height and Volume. From the str() function we can get a quick understanding of how the data of each variable looks like. For a more detailed understanding use summary() to see the mean, median and quartiles of each variable.

Let us now do some exploratory analysis of these variables by plotting them out.

> plot(trees)

Rplot

We can see from the above plot that volume and girth have a linear relationship. Let’s build a linear model using the knowledge we have gained.  Let’s assume we want to predict the volume of a tree using the girth.

> lmod<-lm(Volume~Girth, data = trees)
> summary(lmod)

Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
 Min 1Q Median 3Q Max 
-8.065 -3.107 0.152 3.495 9.587 

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331 
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

The lm() function calls a linear model and you can see the target and dependent variables the model uses. By using the summary we can get the details of the model.

The model has an intercept of -36.9435 and a slope of 5.0659. The intercept indicates the point where the equation passes through the Y axis and the slope is the steepness of the linear equation. So for our linear equation Y = a + bX, the model denotes that b is -36.9435 and a is 5.0659 where Y is Volume and X is Girth.

From the Signif. code we can see that Girth is statistically significant for our model. We can plot our model for better understanding.

Rplot01.png

That’s it! We have build our first linear model. The red line indicates that we are able to draw a line to match most of our data points. Some data points are further away to the line and some are quite close. The distance between the predicted and actual values is the residual. In the following blogs let’s see how we can improve our model fit further.

The types of machine learning problems

I have written this 2 part blog to articulate the technical aspects of machine learning in layman’s terms. For part 1 of this series click here


In the last blog we looked into a small introduction to machine learning and why it is important.  I suggest you read the last blog to get a better understanding from this one.

This time we will dive into a more technical introduction to the types of machine learning problems. Let’s look closely into the situations you might come across when you are trying to build your own predictive model.

220px-Chocolate_Cupcakes_with_Raspberry_Buttercream.jpg

Situation 1

Imagine you own a bakery. Your business seems to be quite popular among all types of customers – kids, teens, adults.  But you want to know if the people truly like your bakery or not. It can depend on anything (e.g., the order they place, their age, their favourite flavour, suggestions from their family, their friends). These are the predictor variables that impact our answer. But the answer you are looking forward is a simple Yes or No. Do people like your bakery or not? This type of machine learning is known as classification. Sometimes there are more than 2 categories. For example how much do people like your bakery (Very much, Quite a bit, Not at all). These are ordinal classifiers. Ordinal classifiers can also be 1, 2 or 3 but remember this is not the same as regression (see below)

Situation 2

You are the owner of the same bakery. But you want more than a classification answer. You want to go straight to the target and find out how much a customer might spend based on their historic data. You are now looking at a numerical scale measurement for an answer. It can range anywhere from £5 to £15 per visit. Imagine every time you see a new customer walk into your bakery you see the amount they are most likely to spend floating above their head. This is a regression situation.

Situation 3

You don’t know what you want to know. You just want to know if there are groups of customers who are likely to act in a particular way. Do little kids always go for the cupcakes with cartoon characters. Do young teens with their girlfriend / boyfriend go for heart shaped one? You want the data to frame the question and answer it. We are looking for patterns, groups or clusters in the data. This is the Clustering problem

In situations 1 and 2 we have a question framed, we have a set of predictors that we think might influence the answer to our question. This type of machine learning is known as supervised learning. In situation 3, we did not have any question in our mind but we are looking to find patterns or groups from the data. This is known as unsupervised learning.

Summary:

  • Classification: supervised machine learning method where the output variable takes the form of class labels.
  • Regression: supervised machine learning method where the output variable takes the form of continuous values.
  • Clustering: unsupervised machine learning method where we group a set of objects and find whether there is some relationship between the objects

For part 1 of this series click here

Or read my blog on big data

Why do we need Machine Learning?

I have written this 2 part blog to articulate the technical aspects of machine learning in layman’s terms. For part 2 of this series click here


It is now the age of data. For several years we humans have been collecting and collating data for various purposes. When I was a kid, I loved borrowing books from the neighborhood local lending library in Chennai, India. It’s a tiny place stacked with aisles of books, I got access to my first Harry Potter book there. Every time I borrow a book, the librarian used to pull out a heavy bundle of papers, go through them to locate my library number and once he pulls out my sheet he would jot down the title of my book and the date. I used to wonder back then (keep in mind this was the 90s) how the poor librarian would go through all those papers and find how many customers need to be paying late fees. It must have been a nightmare having to go through all those papers one by one and see if the due date of the a book has passed.

Library10.JPG

Gone are the 90s, we now have computers in the 00s. I visit the same library and see the stacks of papers replaced by a bulky white computer. I borrow a book and the librarian now enters the book ID and customer ID on the computer and then it is magic! All late payments are tracked and everything is perfect now. The librarian doesn’t have to go through stacks and stacks of data in the weekends. The librarian is happy.

Say hello to the 2010s. What now? There are other libraries popping up closer to his. Imagine these are not council libraries lending books for free and these are all privately owned libraries making good profit by lending books to people. Having new libraries popping up means there is competition for your business.  With the advent of Kindle very few people prefer to own and read physical books and this means that few will visit his library. What is the use of the computer now when things are getting more digital.  What does the librarian have that will draw in the right customers to his library?

The answer lies in data. Imagine, he has access to all his previous and current customer’s data. He has collected information on  his customers’ profile (like age, gender, address, education, qualification) etc. He also has information on each book’s profile. Has a particular book been quite popular than others? Does age impact the genre of the book a customer chooses?

He can answer several other questions like these using his data. What makes a customer visit his library regularly? Is it the customer’s location, age or gender? Is it the librarian’s books that influence him? If only he could build a machine that would take into account every single instance that would impact the customer and somehow learn what it is that makes him/her stay? It would be useful if the machine could just predict if a new customer will stay or leave.

The answer lies in machine learning. What is machine learning anyway? It is the process of analysing vast amounts of information (or data), look into several variables (instances like customer information, etc) and predict if a future customer will stay or leave.

The librarian can just use a type of marketing medium to gain more publicity. Imagine a new situation, replace the library to a bigger organisation. This time there are several more problems to address. We might have several more customers to target. They are scattered everywhere around the country. We have several more variables. We are faced in a situation where we need something more than just advertising now. An important point to note is that machine learning is not a replacement to all your current operations. It is only a complement / an add-on bonus.

See how Amazon is taking up the librarian’s problem to the next level with several years of online and offline experience in book sales.

For part 2 of this series click here