Machine Learning Python

Introduction to Linear Regression

Pinterest LinkedIn Tumblr

Linear regression is one of the most widely known and well-understood algorithms in the Machine Learning landscape. Since it’s one of the most common questions in interviews for a data scientist.

In this tutorial, you will understand the basics of the linear regression algorithm. How it works, how to use it and finally how you can evaluate its performance.

Previously  I gave you highlights on different learning problems under Supervised Learning, which could be Regression or Classification. Let’s see an example of a regression problem.

House Price Linear Regression

A predictor line which predicts the estimates the housing price machine learning
Figure 1: A predictor line which predicts the estimates of housing price.

Let’s say we are using the housing prices dataset from the City of Belgrade, Serbia. In this dataset, we have the number of houses of different sizes sold for different prices.

Estimate the price of the house size given the size in squared feet machine learning
Figure 2: Estimate the price of the house size given the size in squared feet.

Given this dataset, your landlord is looking to put one of his houses up for sale and his home is right about 1750 square feet. With the house’s size provided, you want to suggest how for how much he will sell his house. As we fit a straight line through these data points, we may suggest that he could sell it for around $2 million based on the graph.

Linear Regression One Variable

A simple illustration of how a Supervised Learning algorithm works is by feeding the data collected, or the “Training set”, to our learning algorithm. It’s now the algorithms’ job to output a function that estimates a given house’s price.

ML processing. Collecting data, passing the trained data into an algorithm and making predictions
Figure 3: The process of a Supervised Learning Algorithm.

You might be wondering how our estimated function is defined. To remind you, our prediction function for one variable is the equation of a straight line defined as \(y = \theta_{0} + \theta_{1} * x\) commonly seen as \(y=mx+b\).

Note: \(x\) is the house’s size, and \(y\) is the price placed on these houses. \(\theta_{1}\) is the slope or gradient of a line, and \(\theta_{0}\) is the y-intercept or simply where the line crosses the \(y\) axis. 

Here are some essential notations I will be using consistently.

  • m = number of training examples
  • n = number of features
  • X = feature matrix
  • x = feature vector
  • y = target vector
  • \(\theta\) = parameters

Now let’s conduct a little experiment. Set up your environment, open up a new file, title it, save it, and insert the following code. Let’s roll.

Let’s begin by importing libraries as well as modules from Matplotlib, NumPy, and also scikit-learn. If you aren’t yet aware of these collections

  • Numpy: is made use of for mathematical processing with Python.
  • Matplotlib: For data visualization in Python. And,
  • Scikit-learn: Which has the machine learning algorithms we’ll cover today.

The snippet above takes care of importing our needed Python packages. We’ll be using the scikit-learn package, so if you do not already have it set up, make sure you adhere to these guidelines to get it set up on your computer.

We will additionally be utilizing the NumPy library and you might follow the setup process right here and also for Matplotlib.

For training, we will be using the feature “size of the house” as an input feature to predict the price of each house.

Sample of structure data for linear regression with one variable. The size of the house with the price at which it was sold
Figure 4: Training dataset with one feature.

Here \(x\) is the house’s size, while \(y\) corresponds to the houses’ price.

Next, let’s start by instantiating the Linear Regression class to a variable classifier. From there, we access the fit function to train the algorithm with the given argument \(x\) and \(y\).

To visualize the line that best fits our data points, we used \(\theta_{0} \) and \(\theta_{1}\) along with \(x\).

Here’s a quick question for you:

To test the linear relationship of y (dependent) and x (independent) continuous variables, which plot is best suited? Click To Tweet

Yep, you guessed it right. It’s a scatter plot.

Here we plot our training data and the best-fitted line used to determine the house’s price.

Best fit line on a sample dataset
Figure 5: The predictor line that best fits our training dataset.

To begin making housing price predictions, we need to supply the model a new sample it has not seen before and finally execute our script.

Prediction on new sample datapoint linear regression
Figure 6: Estimating the new price of a house given the size in feet squared.

Once we execute the saved script, we will get the resulting price.

Now that we’ve learned what working with only one feature vector is like let’s move on towards working with multiple features.

Linear Regression Multiple Variables

Let’s look into Linear Regression with Multiple Variables. It’s known as Multiple Linear Regression.

In the previous example, we had the house size as a feature to predict the price of the house with the assumption of \(\hat{y}= \theta_{0} + \theta_{1} * x\).

Sample data with multiple features
Figure 7: Training dataset with multiple features.

Now we were given a lot more information to anticipate the price, like the number of bedrooms and the age of these houses (years). The form of our new assumption will be \(\hat{y}(x) = \theta_{0} + \theta_{1} * x_{1} + \theta_{2} * x_{2} + \theta_{3} * x_{3}\).

Instead of writing our new assumption as above, we need to define a mathematical function to generalize when working with more than three attributes, as we will indeed represent as \(n\).

To generalize, we will define our feature \(x_{0} = 1\) (constant) to avoid the line passing through the origin. Then \(\hat{y} = \theta \cdot x^T\) where our predictions is simply the vector product between our parameter vector \(\theta\) consisting of \(\theta = [\theta_{0},…,\theta_{n}]\) and x consisting of \( x_{0} \) the constant concatenated with \(n\) features (houses size, no of bedrooms, home’s age), \( X = [1, x_{1}, …, x_{n}] \).

Let’s conduct another little experiment. Open up another new file, title it, save it, and insert the following code.

We will use the same function model” as above. So please copy and paste it into your new script.

The only distinction between the snippet we have seen currently and the one from before is from Line 3 – 7. Below we included many more attributes such as the variety of bedrooms and the home’s age to predict the cost of a house.

Once we execute the python script, we get the output above.

Next, we need to specify a metric we can use to evaluate our linear regression model.

How can we evaluate the error?

To determine if our prediction is good, we need to define the error or residual to be the difference between the target \(y\) and our predicted value \(\hat{y}\).

Mean squared error explanation
Figure 8: Visualization of how we can evaluate if our prediction is good or not.

If our model predicted well, the difference between our prediction \(\hat{y}\) and \(y\) would be very small on average.

\( J(\theta) = \frac{1}{2m} \sum_{j} (y^{(j)} – \hat{y}^{(j)})^2\)

We can rewrite the equation as below.

\( J(\theta) = \frac{1}{2m} \sum_{j} (y^{(j)} – \theta \cdot x^{(j)T})^2\)

Using the equation of the mean squared error above as our cost function, we can get the sum of the differences across multiple samples between our observed values and the predictions. Where \(m\) is the total number of samples (the red dots above).

You might ask three questions.

  • The first being, why are we squaring the difference?
  • Secondly, why are we dividing by 2?
  • Thirdly, how does \(\theta\) affect the features we are multiply by?

Let’s answer the first question. The reason why we square the difference \((y^{(j)} – \hat{y}^{(j)})^2\) is that bigger mistakes result in even more errors than smaller sized mistakes, meaning our model is punished for making larger errors.

And the second question is to simplify for mathematical convenience. When you differentiate \(J(\theta)\), you will get an extra 2. To eliminate that, 2 is kept beforehand in the denominator. Not to overload you, in the next tutorial, I will go into more details with the derivation.

Note: They are many other metrics available. However, the mean squared error is a common choice since it’s computationally convenient.

Thirdly, the best way I like to think of \(\theta\) is as a set of weights or parameters that control the system’s behavior. Meaning it determines how each of the features affects the price of the house. So,

  • Positive weight = If we receive a positive weight, increasing the value of that feature increases the value of our prediction \(\hat{y}\)
  • Zero weight = If our feature’s weight is zero, then it does not affect our prediction \(\hat{y}\)
  • Negative weight = If we receive a negative weight, increasing the value of that feature decreases the value of our prediction \(\hat{y}\)

Finding good parameters

Given a representation of the cost function \(J(\theta)\) the problem of learning is converted into a basic optimization problem which we will learn more in detail in the next tutorial. A simple quesion I want you to think about is:

How can we find the correct values of Θ that minimizes our cost function J(Θ)? Click To Tweet


To conclude this tutorial, you discovered how to implement linear regression step-by-step with a sample dataset. You learned:

  • How to call the Linear Regression model from the sklearn module.
  • How to make predictions using your learned model.
  • A metric we can use when evaluating your regression model
  • Three commonly asked questions about linear regression.

In the next tutorial, we will learn how to find both the intercept coefficient and regressions coefficient (also known as weights when referring to deep learning) for a linear regression model from your training data.

Do you have any questions about this post or linear regression? Leave a comment and ask questions, I’ll do my best to answer.

To get access to the source codes used in all of the tutorials, leave your email address in any of the page’s subscription forms.

Further Reading

We have listed some useful resources below if you thirst for more reading.




To be notified when this next blog post goes live, be sure to enter your email address in the form!


  1. Great post, David!

    I think you should mention the shortcomings of linear regression as well. For instance, referring to the Anscombe’s Quartet will be really helpful for learners to understand that linear regression can fail if you have non-linear relationships in your data.

  2. Your reasoning should be accepted as the benchmark when it comes to this topic.

  3. May I just say what a relief to discover someone
    who actually knows what they’re discussing on the internet.
    You certainly understand how to bring a problem to light
    and make it important. More people have to check this out and understand this side of your story.
    It’s surprising you’re not more popular since you most
    certainly possess the gift.

    • David Praise Chukwuma Kalu Reply

      Thank you, Lukas. Soon I’m hoping more people will reach my content. Feel Free to share among your colleges.

Write A Comment