Predictive modelling of Iris Species with Caret

The Iris dataset was used in Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. It includes three iris species with 50 samples each as well as some properties about each flower. In this blog, I will use the caret package from R to predict the species class of various Iris flowers. Let’s jump into the code.

Calling/Invoking all the necessary packages in R

#Calling libraries
> library(AppliedPredictiveModeling)
> library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

> library(pROC)

## Type 'citation("pROC")' for a citation.

## 
 ## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
 ## 
 ##     cov, smooth, var

The Iris dataframe is already included in R which we can attach using the data() command. A quick summary() and head() also gives us a nice introduction to the dataset. We can also set a seed value to output reproducable results.

#Prepare the dataset
> data(iris)
> summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 ##        Species  
 ##  setosa    :50  
 ##  versicolor:50  
 ##  virginica :50  
 ##                 
 ##                 
 ##

> head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 ## 1          5.1         3.5          1.4         0.2  setosa
 ## 2          4.9         3.0          1.4         0.2  setosa
 ## 3          4.7         3.2          1.3         0.2  setosa
 ## 4          4.6         3.1          1.5         0.2  setosa
 ## 5          5.0         3.6          1.4         0.2  setosa
 ## 6          5.4         3.9          1.7         0.4  setosa

> nrow(iris)

## [1] 150

> set.seed(9999)

The following plot command from the caret package can show us how the Species values are distributed in a pairwise plot. An observation from this plot is that Versicolor and Virginica have similar patterns and Setosa is quite distinct. An intuition is that we can predict Setosa easily and might have some challenges in Versicolor and Virginica.

#Visualising the dataset
> transparentTheme(trans = .4)
> featurePlot(x = iris[, 1:4], 
             y = iris$Species, 
             plot = "ellipse",
             auto.key = list(columns = 3))

Screen Shot 2017-09-29 at 17.04.59

Now we split up the dataset into a train and a test partition. We use 80% of the dataset to use into train and the remaining 20% into test. We also define a 10 fold cross validation method to be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

#Split into train and test dataset
> trainIndex <- createDataPartition(iris$Species, p = .8,
 list = FALSE,
 times = 1)
> train <- iris[ trainIndex,]
> test  <- iris[-trainIndex,]

> nrow(train)

## [1] 120

> nrow(test)

## [1] 30

#Cross validation
> fitControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)

Our problem is a Classification problem since our output predictor variable is a class. We have several machine learning algorithms for class prediction. We implement 3 methods here and compare them to find the best fit.

First fit a decision tree model (Rpart) to the train dataset. A decision tree is a model with an “if this then that approach” and it is easy & fast to intepret with visualising the tree.

We also preprocess the dataset so that any variable with very different range of values will not affect our outcome.

> dt.fit <- train(Species ~ ., data = train,
 method = "rpart",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: rpart

> dt.fit

## CART
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   cp    Accuracy   Kappa
 ##   0.00  0.9550000  0.9325
 ##   0.45  0.7683333  0.6525
 ##   0.50  0.3333333  0.0000
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was cp = 0.

Accuracy is the percentage of correctly classifies instances out of all instances. From the above results we can see that the model has a good accuracy value. Kappa is similar to accuracy but it is more based on a normalised random draw of the dataset, i.e, it would be more useful for class imbalanced classifications. CP is the complexity parameter which is used to control the decision tree’s size and choose the optimal tree size. We can see that the model has chosen the tree size with the best accuracy.

Next, we predict the test dataset using the trained model. A confusion matrix is used to understand the performance of the model. It is a table wise comparison of actual and predicted values. A variable importance plot shows that Petal width is the most important variable that has helped us predict the Species class.

> predictions <- predict(dt.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(dt.fit))

Screen Shot 2017-09-29 at 17.10.34

The decision tree has predicted an accuracy of 93%

The second algorithm we use is the K – Nearest neighbor algorithm. This is a great method for classification of the iris dataset. In simple words, it takes inputs from the neighborhood data points and predicts the test data with confidence. K is the number of segments and the algorithm has chosen the best K value based on accuracy.

> knn.fit <- train(Species ~ ., data = train,
 method = "knn",
 trControl = fitControl,
 preProcess=c("center", "scale"))

> knn.fit

## k-Nearest Neighbors
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   k  Accuracy  Kappa
 ##   5  0.955     0.9325
 ##   7  0.955     0.9325
 ##   9  0.945     0.9175
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was k = 7.

> predictions <- predict(knn.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(knn.fit))

Screen Shot 2017-09-29 at 17.11.16

The KNN predicts with an accuracy of 93%

The final method we use is the Random Forest method. This method uses a set of decision trees to aggregate the final results. This way we can minimize error caused from individual decision trees. The mtry value is the number of variables available for splitting of each tree node. Here again the optimal value model is selected based on accuracy.

> rf.fit <- train(Species ~ ., data = train,
 method = "rf",
 trControl = fitControl,
 preProcess=c("center", "scale"))

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

##
 ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
 ##
 ##     margin

> rf.fit

## Random Forest
 ##
 ## 120 samples
 ##   4 predictor
 ##   3 classes: 'setosa', 'versicolor', 'virginica'
 ##
 ## Pre-processing: centered (4), scaled (4)
 ## Resampling: Cross-Validated (10 fold, repeated 5 times)
 ## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
 ## Resampling results across tuning parameters:
 ##
 ##   mtry  Accuracy   Kappa
 ##   2     0.9533333  0.9300
 ##   3     0.9616667  0.9425
 ##   4     0.9616667  0.9425
 ##
 ## Accuracy was used to select the optimal model using  the largest value.
 ## The final value used for the model was mtry = 3.

> predictions <- predict(rf.fit, test)

> confusionMatrix(predictions, test$Species)

## Confusion Matrix and Statistics
 ##
 ##             Reference
 ## Prediction   setosa versicolor virginica
 ##   setosa         10          0         0
 ##   versicolor      0         10         2
 ##   virginica       0          0         8
 ##
 ## Overall Statistics
 ##
 ##                Accuracy : 0.9333
 ##                  95% CI : (0.7793, 0.9918)
 ##     No Information Rate : 0.3333
 ##     P-Value [Acc > NIR] : 8.747e-12
 ##
 ##                   Kappa : 0.9
 ##  Mcnemar's Test P-Value : NA
 ##
 ## Statistics by Class:
 ##
 ##                      Class: setosa Class: versicolor Class: virginica
 ## Sensitivity                 1.0000            1.0000           0.8000
 ## Specificity                 1.0000            0.9000           1.0000
 ## Pos Pred Value              1.0000            0.8333           1.0000
 ## Neg Pred Value              1.0000            1.0000           0.9091
 ## Prevalence                  0.3333            0.3333           0.3333
 ## Detection Rate              0.3333            0.3333           0.2667
 ## Detection Prevalence        0.3333            0.4000           0.2667
 ## Balanced Accuracy           1.0000            0.9500           0.9000

> plot(varImp(rf.fit))

Screen Shot 2017-09-29 at 17.11.42.png

The random forest predicts with an accuracy of 93%

From the above analysis we can see that all models have performed very well and therefore we can use the predictions from either of the model.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s