The Iris dataset was used in Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. It includes three iris species with 50 samples each as well as some properties about each flower. In this blog, I will use the caret package from R to predict the species class of various Iris flowers. Let’s jump into the code.

Calling/Invoking all the necessary packages in R

#Calling libraries
> library(AppliedPredictiveModeling)
> library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
> library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var

The Iris dataframe is already included in R which we can attach using the data() command. A quick summary() and head() also gives us a nice introduction to the dataset. We can also set a seed value to output reproducable results.

#Prepare the dataset
> data(iris)
> summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
> head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
> nrow(iris)
## [1] 150
> set.seed(9999)

The following plot command from the caret package can show us how the Species values are distributed in a pairwise plot. An observation from this plot is that Versicolor and Virginica have similar patterns and Setosa is quite distinct. An intuition is that we can predict Setosa easily and might have some challenges in Versicolor and Virginica.

#Visualising the dataset
> transparentTheme(trans = .4)
> featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "ellipse",
auto.key = list(columns = 3))

Now we split up the dataset into a train and a test partition. We use 80% of the dataset to use into train and the remaining 20% into test. We also define a 10 fold cross validation method to be repeated 5 times. This process decreases over-fitting in the training set and helps the model work on an unknown or new dataset. The model will be tested and trained several times on subsets of the training data to increase the accuracy in the test data.

#Split into train and test dataset
> trainIndex <- createDataPartition(iris$Species, p = .8,
list = FALSE,
times = 1)
> train <- iris[ trainIndex,]
> test <- iris[-trainIndex,]
> nrow(train)
## [1] 120
> nrow(test)
## [1] 30
#Cross validation
> fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5)

Our problem is a Classification problem since our output predictor variable is a class. We have several machine learning algorithms for class prediction. We implement 3 methods here and compare them to find the best fit.

First fit a decision tree model (Rpart) to the train dataset. A decision tree is a model with an “if this then that approach” and it is easy & fast to intepret with visualising the tree.

We also preprocess the dataset so that any variable with very different range of values will not affect our outcome.

> dt.fit <- train(Species ~ ., data = train,
method = "rpart",
trControl = fitControl,
preProcess=c("center", "scale"))
## Loading required package: rpart
> dt.fit
## CART
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00 0.9550000 0.9325
## 0.45 0.7683333 0.6525
## 0.50 0.3333333 0.0000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

Accuracy is the percentage of correctly classifies instances out of all instances. From the above results we can see that the model has a good accuracy value. Kappa is similar to accuracy but it is more based on a normalised random draw of the dataset, i.e, it would be more useful for class imbalanced classifications. CP is the complexity parameter which is used to control the decision tree’s size and choose the optimal tree size. We can see that the model has chosen the tree size with the best accuracy.

Next, we predict the test dataset using the trained model. A confusion matrix is used to understand the performance of the model. It is a table wise comparison of actual and predicted values. A variable importance plot shows that Petal width is the most important variable that has helped us predict the Species class.

> predictions <- predict(dt.fit, test)
> confusionMatrix(predictions, test$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 2
## virginica 0 0 8
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.7793, 0.9918)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 8.747e-12
##
## Kappa : 0.9
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8000
## Specificity 1.0000 0.9000 1.0000
## Pos Pred Value 1.0000 0.8333 1.0000
## Neg Pred Value 1.0000 1.0000 0.9091
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.2667
## Detection Prevalence 0.3333 0.4000 0.2667
## Balanced Accuracy 1.0000 0.9500 0.9000
> plot(varImp(dt.fit))

The decision tree has predicted an accuracy of 93%

The second algorithm we use is the K – Nearest neighbor algorithm. This is a great method for classification of the iris dataset. In simple words, it takes inputs from the neighborhood data points and predicts the test data with confidence. K is the number of segments and the algorithm has chosen the best K value based on accuracy.

> knn.fit <- train(Species ~ ., data = train,
method = "knn",
trControl = fitControl,
preProcess=c("center", "scale"))
> knn.fit
## k-Nearest Neighbors
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.955 0.9325
## 7 0.955 0.9325
## 9 0.945 0.9175
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
> predictions <- predict(knn.fit, test)
> confusionMatrix(predictions, test$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 2
## virginica 0 0 8
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.7793, 0.9918)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 8.747e-12
##
## Kappa : 0.9
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8000
## Specificity 1.0000 0.9000 1.0000
## Pos Pred Value 1.0000 0.8333 1.0000
## Neg Pred Value 1.0000 1.0000 0.9091
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.2667
## Detection Prevalence 0.3333 0.4000 0.2667
## Balanced Accuracy 1.0000 0.9500 0.9000
> plot(varImp(knn.fit))

The KNN predicts with an accuracy of 93%

The final method we use is the Random Forest method. This method uses a set of decision trees to aggregate the final results. This way we can minimize error caused from individual decision trees. The mtry value is the number of variables available for splitting of each tree node. Here again the optimal value model is selected based on accuracy.

> rf.fit <- train(Species ~ ., data = train,
method = "rf",
trControl = fitControl,
preProcess=c("center", "scale"))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
> rf.fit
## Random Forest
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9533333 0.9300
## 3 0.9616667 0.9425
## 4 0.9616667 0.9425
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
> predictions <- predict(rf.fit, test)
> confusionMatrix(predictions, test$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 2
## virginica 0 0 8
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.7793, 0.9918)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 8.747e-12
##
## Kappa : 0.9
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8000
## Specificity 1.0000 0.9000 1.0000
## Pos Pred Value 1.0000 0.8333 1.0000
## Neg Pred Value 1.0000 1.0000 0.9091
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.2667
## Detection Prevalence 0.3333 0.4000 0.2667
## Balanced Accuracy 1.0000 0.9500 0.9000
> plot(varImp(rf.fit))

The random forest predicts with an accuracy of 93%

From the above analysis we can see that all models have performed very well and therefore we can use the predictions from either of the model.