Linear Regression using Gradient Descent in Python

gradient descent plot with python

In the last two tutorials,  I have specified the equation of the mean squared error $J(\theta)$, which measures how wrong the model is in modeling the relationship between $ X $ and $y$.

In continuation of the previous tutorial behind the gradient descent algorithm, you will undoubtedly learn how to perform linear regression using gradient descent in Python on a new cost function $J(\theta)$ the “mean square error”.

Not only will you get a greater understanding of the algorithm, which I had guaranteed, yet additionally execute a Python script yourself.

First, let’s review some ideas concerning the topic.

Gradient Descent Review

What we want to achieve is to update our $\theta$ value in a way that the cost function gradually decreases over time. 

To accomplish this, what should we do actually? This is a specific questions we have to decide on. Should we reduce or increase the values of $\theta$ ?

Considering the graph, we must take a step in the negative direction (going downwards).

Every step we take always provides us a new value of $\theta$. When we obtain a new value, we need to constantly keep track of the gradient, and evaluate it gradient with newfound value,  then repeat taking the same step until we are at the global minimum (the blue dot).

Gradient Descent Illustration

A straightforward illustration of the gradient descent algorithm is as follow:

Initialize theta

# keep updating our theta values while we haven't converged
do {
     theta = theta - alpha * J(theta)
} while(Not converged)
  1.  Initialization: We initialize our parameters $\theta$ arbitrarily.
  2.  Iteration: Then iterate finding the gradient of our function $J(\theta)$ and updating it by a small learning rate, which may be constant or may change after a certain number of iterations.
  3.  Stopping: Stopping the procedure either when $J(\theta)$ is not changing adequately or when our gradient is sufficiently small. 

Gradient For The Mean Squared Error (MSE)

$J(\theta) = \frac{1}{2m} \sum_{j} (\theta x^{(j)} – y^{(j)})^ 2$      [1.0]

For linear regression, we can compute the Mean Squared Error cost function and then compute its gradient.

$J(\theta) = \frac{1}{2m} \sum_{j} (\theta_{0}x_{0}^{(j)}\hspace{2mm}  + \hspace{2mm}\theta_{1}x_{1}^{(j)}\hspace{2mm} +\hspace{2mm} … \hspace{2mm}\theta_{n}x_{n}^{(j)} – y^{(j)})^ 2$      [1.1]

Let’s expand our prediction right into each term and currently signify the error residual $e_{j}(\theta)$ to simplify the way we will write it.

$e_{j}(\theta)$ as  $(\theta_{0}x_{0}^{(j)}\hspace{2mm}  + \hspace{2mm}\theta_{1}x_{1}^{(j)}\hspace{2mm} +\hspace{2mm} … \hspace{2mm}\theta_{n}x_{n}^{(j)}\hspace{2mm} – \hspace{2mm}y^{(j)})$

Taking the derivative of this equation sounds a little more tricky if you’ve been out of calculus class for a while. However, don’t stress too much. I’ve got you covered. Let’s begin by showing you the derivation procedure.

$\frac{\partial J(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \frac{1}{2m} \sum_{j} e_{j}(\theta)^2$      [1.2]


$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \sum_{j} \frac{\partial}{\partial \theta} e_{j}(\theta)^2$      [1.3]

To move from formula [1.2] to [1.3], we need to use two necessary derivative regulations. The scalar multiple and sum rule:

Scalar multiple rule: $\hspace{10mm} \frac{d}{dx} (\alpha\hspace{0.3mm} u) = \alpha \frac{du}{dx}$


Sum rule: $ \hspace{10mm} \frac{d}{dx} (\sum \hspace{0.3mm} u) = \sum \frac{du}{dx}$


$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \sum_{j} \hspace{1mm}2 e_{j}\hspace{1mm}(\theta)\hspace{1mm} \frac{\partial}{\partial \theta}\hspace{1mm} e_{j}(\theta)$      [1.4]

Next what we did to get from [1.3] to [1.4], is we applied both the power rule and the chain rule. If you’re not accustomed to the guidelines, you may discover them below:

Power rule: $\hspace{10mm} \frac{d}{dx} u ^n = nu^{n-1} \frac{du}{dx}$


Chain rule: $\hspace{10mm} \frac{d}{dx} f(g(x)) = {f}'(g(x))\hspace{1mm}{g}'(x)$

$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \sum_{j} e_{j}(\theta)$      [1.5]

Lastly, to go from [1.4] to [1.5], equation [1.5] gives us the partial derivative of the MSE $J(\theta)$ with respect to our $\theta$ variables.

We can assess by taking the partial derivative of the MSE with respect to $\theta_{0}$.

$\frac{\partial}{\partial \theta_{0}} e_{j}(\theta) = \frac{\partial}{\partial \theta_{0}}\theta_{0}x_{0}^{(j)} + \frac{\partial}{\partial \theta_{0}}\theta_{1}x_{1}^{(j)} + …. + \frac{\partial}{\partial \theta_{0}}\theta_{n}x_{n}^{(j)} \hspace{1mm}-\hspace{1mm} y^{(j)}= x_{0}^{(j)} $ 


To find the given partial derivative of the function, we should deal with the other variables as constants. Such as when taking this partial derivative of $\theta_{0}$, all other variables are dealt with as constants $( y, \theta_{1}, x_{1}, …, x_{n} )$.

We can duplicate this for various other parameters like $\theta_{1}, \theta_{2} …. \theta_{n} $ and instead of just attribute $x_{0}^{j}$ we will  obtain features $x_{1}^{j}, x_{2}^{j}, … ,x_{n}^{j}$.

Applying Gradient Descent in Python

Now we know the basic concept behind gradient descent and the mean squared error,  let’s implement what we have learned in Python.

Open up a new file, name it, and insert the following code:

# import the necessary packages
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns


Let’s start by importing our required Python libraries from Matplotlib, NumPy, and Seaborn.

def gradient_descent(X, y, alpha=0.01, epochs=30):
    :param x: feature matrix
    :param y: target vector
    :param alpha: learning rate (default:0.01)
    :param epochs: maximum number of iterations of the
           linear regression algorithm for a single run (default=30)
    :return: weights, list of the cost function changing overtime

    m = np.shape(X)[0]  # total number of samples
    n = np.shape(X)[1]  # total number of features

    X = np.concatenate((np.ones((m, 1)), X), axis=1)
    W = np.random.randn(n + 1, )

    # stores the updates on the cost function (loss function)
    cost_history_list = []

Next, let’s arbitrarily initialize and concatenate our intercept coefficient and regression coefficient represented by $W$ “weights.”

One thing that might puzzle you by now is, why initialize our weights arbitrarily? Why not establish it as all 0’s or 1’s? 

The reason behind this logic is because:

  • If the weights in our network start too small, then the output shrinks until it's too small to be useful.
  • If the weights in our network start too large, then the output swells until it's too large to be useful.

So we intend to set it at about a suitable value because we want to calculate interesting functions.

Once those values have been initialized arbitrarily, let’s define some constants based on the size of our Dataset and an empty list to keep track of the cost function as it changes each iteration.

  • n : The overall number of samples
  • cost_history_list : Save the history of our cost function over after every iteration (at some point described as epochs)
    # iterate until the maximum number of epochs
    for current_iteration in np.arange(epochs):  # begin the process

        # compute the dot product between our feature 'X' and weight 'W'
        y_estimated =

        # calculate the difference between the actual and predicted value
        error = y_estimated - y

        # calculate the cost (Mean squared error - MSE)
        cost = (1 / 2 * m) * np.sum(error ** 2)

While iterating, until we reach the maximum number of epochs, we calculate the estimated value y_estimated which is the dot product of our feature matrix $X$ as well as weights $W$.

With our given predictions, the following step is to determine the “error” or merely the difference between our actual value “y” and our estimated value “y_estimated.” Then evaluate our prediction with the overall performance of the machine learning model for the given training data with a set of weights initialized to get the cost

        # Update our gradient by the dot product between
        # the transpose of 'X' and our error divided by the
        # total number of samples
        gradient = (1 / m) *

        # Now we have to update our weights
        W = W - alpha * gradient

The next action will be to calculate the partial derivative with respect to the weights $W$. When we have the gradient, we need to readjust the previous values for $W$. That is, the gradient multiplied by the learning rate alpha deducted from the previous weight matrix.

        # Let's print out the cost to see how these values
        # changes after every 10th iteration
        if current_iteration % 10 == 0:
            print(f"cost:{cost} \t iteration: {current_iteration}")

        # keep track the cost as it changes in each iteration

    return W, cost_history_list

A condition statement is being utilized to print out how the cost function changes every 10 epochs on the console. 

To keep track of exactly how the cost function changes over time, we include the values into a list cost_history_list on each iteration. Then return our parameters $W$ and cost_history_list populated with the changes in our cost function over time. 

def main():
    rng = np.random.RandomState(10)
    X = 10 * rng.rand(100, 5)  # feature matrix
    y = 0.9 +, [2.2, 4., -4, 1, 2])  # target vector

    # calls the gradient descent method
    weight, cost_history_list = gradient_descent(X, y, alpha=0.01,

    # visualize how our cost decreases over time
    plt.plot(np.arange(len(cost_history_list)), cost_history_list)
    plt.xlabel("Number of iterations (Epochs)")
    plt.ylabel("Cost function  J(Θ)")
    plt.title("Gradient Descent")

if __name__ == '__main__':

For the final step, to walk you through what goes on within the main function, we generated a regression problem on lines 2 – 4. We have a total of 100 data points, each of which are 5D. 

Our goal is to correctly map these randomized feature training samples $(100×5)$ (a.k.a. dependent variable) to our independent variable $(100×1)$.

Within line 7, we pass the feature matrix, target vector, learning rate, and the number of epochs to the gradient_descent function, which returns the best parameters and the saved history of the lost/cost after each iteration.

The last block of code from lines 10 – 14 aids in envisioning how the cost adjusts on each iteration.

To check the cost modifications from your command line, you can execute the following command:


cost:2870624.491941226 	 iteration: 0
cost:145607.02085484721 	 iteration: 10
cost:17394.518134046495 	 iteration: 20
cost:2149.654826207572 	 iteration: 30
cost:326.1941755250162 	 iteration: 40
cost:105.19630973147717 	 iteration: 50
cost:77.23993590601086 	 iteration: 60
cost:72.96453135073557 	 iteration: 70
cost:71.71480871397247 	 iteration: 80
cost:70.89080664015975 	 iteration: 90
gradient descent plot with python
Figure 1: Visualization of the cost function changing overtime

Observations on Gradient Descent

In my view, gradient descent is a practical algorithm; however, there is some information you should know.

Gradient Descent Linear Regression Global minimum and Local minimum
Figure 2: Visualization of the global minimum and local minimum
  1. Local Minima: It can be sensitive to initialization and can get stuck in local minima. Meaning, depending on the value given to $\theta$ it will give us a different gradient direction leading either towards a local or global minimum. 
  2. Step Size: The selection of step size can influence both the convergence rate and our procedure’s behavior. If the chosen step size is too large, we can jump over the minimum, and if it’s too small, we make very little progress increasing the amount of time it takes to converge.


To conclude this tutorial, you discovered how to implement linear regression using Gradient Descent in Python. You learned:

  • An easy brush up with the concept behind Gradient Descent.
  • How to calculate the derivative of the mean squared error.
  • How to implement Linear Regression using Gradient Descent in Python.
  • Observations behind the gradient descent algorithm. 

Do you have any questions concerning this post or gradient descent? Leave a comment and ask your question. I’ll do my best to answer.

To get all access to the source code used in all tutorials, leave your email address in any of the page’s subscription forms.

You should click on the “Click to Tweet Button” below to share on twitter.

Check out the post on Linear Regression using Gradient Descent (GD) with Python.


Further Reading




We have listed some useful resources below if you thirst for more reading.

To be notified when this next blog post goes live, be sure to enter your email address in the form below!

Upon entering your email for new visitors, you will receive memorable infographics just for you by your email and all access to the source code used in all tutorials. Be sure to confirm to respect the privacy of everyone. 


Share on facebook
Share on twitter
Share on linkedin
Share on reddit

Speak Your Mind

Leave a Reply

Your email address will not be published. Required fields are marked *

About Me

David Praise Chukwuma Kalu

David Praise Chukwuma Kalu

He's an entrepreneur who loves Computer Vision and Machine Learning.

Read more about him

Get the cheatsheet I wish I had before starting my career as a

About Me

David Praise Chukwuma Kalu

David Praise Chukwuma Kalu

He's a Data Scientist and as well an entrepreneur who loves Computer Vision and Machine Learning.

Get weekly data science tips from David Praise that keeps you more informed. It's data science school in bite-sized chunks!