Linear regression is a fundamental method used as part of Regression analysis. It is a quick and an easy way to understand how predictive algorithms work in general and in the field of machine learning.
To give a simple overview of how the algorithm works, a linear regression fits a linear relationship for a given target variable(Y) using one or many scalar dependent variables(X).
For example consider the age, height and weight of children. By intuition we can say that these variables are correlated. With increase in height and age there is also an increase in weight. A linear regression would fit a model or an equation for the target variable assuming the relation between X and Y is linear. In many practical applications a linear regression model works effectively.
See figure above of an example of a linear regression model. The red line indicates a linear regression model of a target variable (Y axis) with one dependent variable (X axis)
Let’s perform a linear regression analysis using the trees dataset in R
First let’s attach the dataset in the R workspace and have a sneak peak of how the data looks.
> data(trees) > head(trees) Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 > str(trees) 'data.frame': 31 obs. of 3 variables: $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ... $ Height: num 70 65 63 72 81 83 66 75 80 75 ... $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ... > summary(trees) Girth Height Volume Min. : 8.30 Min. :63 Min. :10.20 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40 Median :12.90 Median :76 Median :24.20 Mean :13.25 Mean :76 Mean :30.17 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30 Max. :20.60 Max. :87 Max. :77.00
The head() function gives us the a top sample of dataset. We can see the dataset has 3 variables namely the Girth, Height and Volume. From the str() function we can get a quick understanding of how the data of each variable looks like. For a more detailed understanding use summary() to see the mean, median and quartiles of each variable.
Let us now do some exploratory analysis of these variables by plotting them out.
We can see from the above plot that volume and girth have a linear relationship. Let’s build a linear model using the knowledge we have gained. Let’s assume we want to predict the volume of a tree using the girth.
> lmod<-lm(Volume~Girth, data = trees) > summary(lmod) Call: lm(formula = Volume ~ Girth, data = trees) Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 *** Girth 5.0659 0.2474 20.48 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.252 on 29 degrees of freedom Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
The lm() function calls a linear model and you can see the target and dependent variables the model uses. By using the summary we can get the details of the model.
The model has an intercept of -36.9435 and a slope of 5.0659. The intercept indicates the point where the equation passes through the Y axis and the slope is the steepness of the linear equation. So for our linear equation Y = a + bX, the model denotes that b is -36.9435 and a is 5.0659 where Y is Volume and X is Girth.
From the Signif. code we can see that Girth is statistically significant for our model. We can plot our model for better understanding.
That’s it! We have build our first linear model. The red line indicates that we are able to draw a line to match most of our data points. Some data points are further away to the line and some are quite close. The distance between the predicted and actual values is the residual. In the following blogs let’s see how we can improve our model fit further.