When we go about understanding Machine Learning models, one of the first things we generally come across is Simple Linear Regression. It’s the first step into Machine Learning and this post will help you understand all you need to know about it. Let’s start with understanding what Regression is.
The term regression was first coined in the 19th century to describe a phenomenon, that the heights of descendants of tall ancestors tend to regress (or approach) towards the normal average height. In other words, regression is the tendency to return to moderation (mean). Interesting right?
In statistics, the term is defined as a measure of the relation between an output variable and the input variable(s). Hence, the Linear Regression assumes a linear relationship between the former and the latter.
Depending upon the number of input variables, Linear Regression can be classified into two categories:
This post is dedicated to explaining the concepts of Simple Linear Regression, which would also lay the foundation for you to understand Multiple Linear Regression. Besides that, we’ll implement Linear Regression in Python to understand its application in Machine Learning. And while doing so, we’ll also learn some important facts down the line. Having said that, let’s dive right into it.
We will express the input variable as X and the output variable as Y, as is generally done. We can then write a relationship between X and Y as:
Here, the two constant terms (β) are Intercept and Slope. You might recognize this expression from your school algebra, where the general expression for a straight line was,
y = c + mx
where c is the intercept and m is the slope. That’s what we try to do in Linear Regression. We try to fit a straight line to observe a relationship between the input and output variables and then further use it to predict the output of unseen inputs.
Let’s bring in data to understand how this works.
We are going to use Advertising data which is available on the site of USC Marshall School of Business. You can download it here.
This data is being used in one of the popular books “An Introduction to Statistical Learning”, which by the way is a must-read if you want to understand the basic Stats for Machine Learning.
The advertising data set consists of the sales of a product in 200 different markets, along with advertising budgets for three different media: TV, radio, and newspaper. This is how it looks like:
Sales (*1000 units) vs Advertising budget (*1000 USD)
The first row of the data says that the advertising budgets for TV, radio, and newspaper were $230.1k, $37.8k, and $69.2k respectively, and the corresponding number of units that were sold was 22.1k (or 22,100).
We will now try to understand how each of these 3 media is associated with sales using Simple Linear regression. Hence, our input variable (X) will be one of the advertising agents and the output variable will be sales (Y).
It’s important to understand your data before drawing conclusions. One must take care of the measuring units and the scale of the variables to tell the right story.
Let’s now observe how our sales look like when plotted against each of the advertising agents.
import pandas as pd
#give the path of your directory containing the csv file as parameter to read_csv.
data = pd.read_csv(.../advertising.csv)
After loading the csv file run the following code to plot the variables.
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(14,3))
plt.subplot(1,3,1)
plt.scatter(data['TV'], data['sales'], 'blue')
plt.xlabel('TV')
plt.ylabel('sales')
plt.subplot(1,3,2)
plt.scatter(data['radio'], data['sales'], color = 'red')
plt.xlabel('radio')
plt.ylabel('sales')
plt.subplot(1,3,3)
plt.scatter(data['newspaper'], data['sales'], color = 'green')
plt.xlabel('newspaper')
plt.ylabel('sales')
plt.show
The above code will produce the following scatter plots.
By looking at the first plot(left), one can deduce that there’s a sharp upward trend in sales as TV advertising is increased. A similar trend can also be observed in the second graph (middle) depicting radio advertising. However, in the last graph (right) the trend is not much defined. What does this mean?
When an increase in the input variable X is observed with a simultaneous increase or decrease in output variable Y, there’s said to be a correlation between the two. This is a measure of how strongly X and Y relate to each other. By visualizing the data, we can intuitively see that TV and sales are strongly related (highly correlated) to each other. On the other hand, there seems to be a weaker correlation between newspaper budget and sales.
Correlation is often misleading as it can look like causation. Just because two variables are correlated, it does not imply that the one causes the change in other. It is for us to dig deep and decide whether it’s just a case of correlation or if it’s causation too.
Linear Regression will help us determine the strength of this relationship i.e how accurately we can predict sales, given a certain advertising medium.
Now that we have understood the data, let’s build a simple model to understand the trend between sales and the advertising agent. For this post, I’ll be using TV as an agent to build the following regression model. I encourage you to do it for the other two agents (radio & newspaper).
→ 𝑆𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 * 𝑇𝑉
This regression model will find the best line that can represent the data, by adjusting the two constants, 𝛽0 & 𝛽1. The best fit is the one that shows the least amount of error in predicting the output for a given value of the input.
We will learn more about error function and model evaluation in just a minute. Till then, let’s code and build the above model.
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegression
# Defining X and Y
X = data['TV'].values.reshape(-1,1)
Y = data['sales'].values.reshape(-1,1)
# Splitting the data into Train & Test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# Fitting the model on Train dataset
Model = LinearRegression(fit_intercept=True)
Model = Model.fit(X_train, y_train)
# Predicting and storing results for Test dataset
train_fit = Model.predict(X_train)
test_pred = Model.predict(X_test)
The regression model finds the best straight line that can represent the data, by adjusting the two constants, 𝛽0 & 𝛽1, such that it represents the data in the best possible way.
Notice that we have split our data into two subsets, train dataset, and test dataset. It’s a common practice in Machine Learning. This allows us to check the performance of the model on both seen and unseen data.
You can find the full code here. Let’s now plot and visualize our model on top of the train and test data sets.
plt.figure(figsize=(12,4))
# Plotting Regression line on Train Dataset
plt.subplot(1,2,1)
plt.scatter(X_train, y_train, color='gray')
plt.plot(X_train, train_fit, color='blue', linewidth=2)
plt.xlabel('TV')
plt.ylabel('sales')
plt.title("Train Dataset")
# Plotting Regression line on Test Dataset
plt.subplot(1,2,2)
plt.scatter(X_test, y_test, color='gray')
plt.plot(X_test, test_pred, color='blue', linewidth=2)
plt.xlabel('TV')
plt.ylabel('sales')
plt.title("Test Dataset")
plt.show()
Great! Our regressor has fit the best model by adjusting the constants. As we can see it has done a pretty good job of placing a straight line in both the train and test data. Let’s check the values of the intercept and the x-coefficient (slope).
print("Intercept is ", Model_1.intercept_[0])
print("Coefficient is ", Model_1.coef_[0][0])
You should get these outputs with probably very slight variations:
Intercept is 7.032593549127693
Coefficient is 0.047536640433019764
The intercept is the value of output when the input is 0. In this case, it is the estimated value of sales in the absence of a TV advertising budget. The value of the intercept here is 7.033, which means that without TV advertisements, the number of units sold is 7,033 (Sales*1000 units).
It’s important to note that the intercept is not always relevant to the problem and may act just as a constant that is needed to adjust the regression line.
The coefficient or the slope is the measure of the change in the output variable per unit change in the input variable. Here, the coefficient of Model is 0.048, which means that if we increase the TV advertising budget by 1 unit ($1000), the sales of the product will increase by approximately 48 units (0.048*1000).
Try to find these values for radio and newspaper. Which advertising medium, do you think is affecting sales the most?
Note that the values of coefficients here are positive. However, the negative value of the coefficient implies a negative correlation, which means that the output decreases as the value of input increases. In our data, we only observe a positive correlation in all the cases i.e there is an increase in sales whenever the budget of advertising is increased in any medium (which makes sense for any business).
Awesome! We just built our Simple Linear Regression model. But how do we know if it’s good enough to predict the value of sales given a budget for TV advertisement? Should we rely on the model to make the right business decisions? If yes, then what are the losses involved? It’s important to answer these questions in real business problems. To do so, we need to evaluate our model and measure how much error it is making while predicting the output.
The error function can be considered as the distance between the current state and the ideal state.
For example, if you have to descend from a mountain peak, then the error function is the height of the mountain, and you keep descending in small steps while reducing the error (height) until you reach the bottom i.e. state of zero error.
Similarly, a model starts with an initial state where it assumes some value of parameters involved and thereby adjusting those parameters to reduce the error function.
In this case, the intercept and the advertising coefficient are the parameters to be adjusted, whereas the error function is the overall difference between the actual sales and the predicted sales.
The vertical lines denote the individual errors in model prediction
The points that lie on the line or are very close to it are the ones that the model was able to predict correctly. However, there are many points that lie away from the regression line. The distance of every such point from the straight line accounts for the error.
Hence, error function e for the ith value can be defined as follows:
This error term is also called residual and can be negative or positive depending on whether the model overpredicted or underpredicted the outcome. Hence, to calculate the net error, adding all the residuals directly can lead to the cancellations of terms and reduction of the net effect. To avoid this, we take the sum of squares of these error terms, which is called the Residual Sum of Squares (RSS).
Intercept and slope are calculated in Linear Regression by minimizing RSS (Residual Sum of Squares) using calculus. Thankfully, the algorithm takes care of this part, and we don’t have to worry about the maths behind it.
RSS can be too big of a number to represent the error function. Hence, we consider the following measures to evaluate the error in Linear Regression.
1. Mean Squared Error (MSE): It is the mean of squared residuals (e²) and is calculated by dividing RSS by the number of data values. MSE = RSS/n.
2. Root Mean Squared Error (RMSE): As the name suggests, it is the square root of Mean Squared Error and is more suitable when large errors are particularly undesirable.
3. Mean Absolute Error (MAE): Instead of taking squares, we take absolute values of residuals and calculate their mean.
We don’t have to worry about calculating these values as it can be done using pre-defined functions in Python. Let’s check these values for our model with the testing dataset.
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
You should see the following output, with slight variations if any:
Mean Squared Error: 10.186181934530222
Root Mean Squared Error: 3.191579849311344
Mean Absolute Error: 2.505418178966003
You can compare these results with your training set error. Did you get similar error values or were they different from testing error?
Now that you have a way to measure the error, you can decide how much error you can allow to accept the model. This depends on the problem you are solving and the penalty of loss that you would face due to wrong predictions.
As you have seen that Linear Regression is a very straightforward approach to modeling and it can yield high errors if the data is more spread out. It’s a non-flexible model that only assumes a linear or a straight-line relationship among variables. Hence, it does not fit through most of the data points, which makes it susceptible to high bias. This causes over-generalization and underfitting. This happens when the model does not capture the essence of training data properly mainly due to its inflexibility.
Even though Linear Regression is a very simple model, it certainly helps understand the working of other high-level models. We saw how Simple Linear Regression works when there is a single predictor. You can find the full code of this post here. I’ll soon write about the Multiple Linear Regression, which is an extension to Simple Linear Regression and is used when there is more than one input variable. Stay tuned.
Previously published at https://towardsdatascience.com/simple-linear-regression-35b3d940950e