Machine Learning Python

Linear Regression using Gradient Descent in Python

Pinterest LinkedIn Tumblr
gradient descent plot with python

In the last two tutorials,  I have specified the equation of the mean squared error \(J(\theta)\), which measures how wrong the model is in modeling the relationship between \( X \) and \( y \).

In continuation of the previous tutorial behind the gradient descent algorithm, you will undoubtedly learn how to perform linear regression using gradient descent in Python on a new cost function \( J(\theta) \) the “mean square error”.

Not only will you get a greater understanding of the algorithm, which I had guaranteed, yet additionally execute a Python script yourself.

First, let’s review some ideas concerning the topic.

Gradient Descent Review

What we want to achieve is to update our \( \theta \) value in a way that the cost function gradually decreases over time.

To accomplish this, what should we do actually? This is a specific questions we have to decide on. Should we reduce or increase the values of \( \theta \) ?

Considering the graph, we must take a step in the negative direction (going downwards).

Every step we take always provides us a new value of \( \theta \). When we obtain a new value, we need to constantly keep track of the gradient, and evaluate it gradient with newfound value,  then repeat taking the same step until we are at the global minimum (the blue dot).

Gradient Descent Illustration

A straightforward illustration of the gradient descent algorithm is as follow:

  1.  Initialization: We initialize our parameters \( \theta \) arbitrarily.
  2.  Iteration: Then iterate finding the gradient of our function \( J(\theta) \) and updating it by a small learning rate, which may be constant or may change after a certain number of iterations.
  3.  Stopping: Stopping the procedure either when \( J(\theta) \) is not changing adequately or when our gradient is sufficiently small.

Gradient For The Mean Squared Error (MSE)

\( J(\theta) = \frac{1}{2m} \sum_{j} (\theta x^{(j)} – y^{(j)})^ 2 \)      [1.0]

For linear regression, we can compute the Mean Squared Error cost function and then compute its gradient.

\( J(\theta) = \frac{1}{2m} \sum_{j} (\theta_{0}x_{0}^{(j)} \hspace{2mm}  + \hspace{2mm} \theta_{1}x_{1}^{(j)} \hspace{2mm} + \hspace{2mm} … \hspace{2mm} \theta_{n}x_{n}^{(j)} – y^{(j)})^ 2 \)      [1.1]

Let’s expand our prediction right into each term and currently signify the error residual \( e_{j}(\theta) \) to simplify the way we will write it.

\( e_{j}(\theta) \) as  \( (\theta_{0}x_{0}^{(j)} \hspace{2mm}  + \hspace{2mm} \theta_{1}x_{1}^{(j)} \hspace{2mm} + \hspace{2mm} … \hspace{2mm} \theta_{n}x_{n}^{(j)} \hspace{2mm} – \hspace{2mm}y^{(j)}) \)

Taking the derivative of this equation sounds a little more tricky if you’ve been out of calculus class for a while. However, don’t stress too much. I’ve got you covered. Let’s begin by showing you the derivation procedure.

\(\frac{\partial J(\theta)}{\partial \theta} = \frac{ \partial}{\partial \theta} \frac{1}{2m} \sum_{j} e_{j}(\theta)^2\)      [1.2]

\(\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \sum_{j} \frac{\partial}{\partial \theta} e_{j}(\theta)^2\)      [1.3]

To move from formula [1.2] to [1.3], we need to use two necessary derivative regulations. The scalar multiple and sum rule:

Scalar multiple rule: \(\hspace{10mm} \frac{d}{dx} (\alpha \hspace{0.3mm} u) = \alpha \frac{du}{dx}\)

Sum rule: \( \hspace{10mm} \frac{d}{dx} (\sum \hspace{0.3mm} u) = \sum \frac{du}{dx}\)

\(\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \sum_{j} \hspace{1mm} 2 e_{j} \hspace{1mm}(\theta) \hspace{1mm} \frac{\partial}{\partial \theta} \hspace{1mm} e_{j}(\theta)\)      [1.4]

Next what we did to get from [1.3] to [1.4], is we applied both the power rule and the chain rule. If you’re not accustomed to the guidelines, you may discover them below:

Power rule: \(\hspace{10mm} \frac{d}{dx} u ^n = nu^{n-1} \frac{du}{dx}\)

Chain rule: \(\hspace{10mm} \frac{d}{dx} f(g(x)) = {f}'(g(x)) \hspace{1mm}{g}'(x)\)

\(\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \sum_{j} e_{j}(\theta)\)      [1.5]

Lastly, to go from [1.4] to [1.5], equation [1.5] gives us the partial derivative of the MSE \(J(\theta)\) with respect to our \(\theta\) variables.

We can assess by taking the partial derivative of the Mean Squared Error (MSE) with respect to \(\theta_{0}\).

\( \frac{\partial}{\partial \theta_{0}} e_{j}(\theta) = \frac{\partial}{\partial \theta_{0}} \theta_{0}x_{0}^{(j)} + \frac{\partial}{\partial \theta_{0}} \theta_{1}x_{1}^{(j)} + …. + \frac{\partial}{\partial \theta_{0}} \theta_{n}x_{n}^{(j)} \hspace{1mm} – \hspace{1mm} y^{(j)}= x_{0}^{(j)} \)

To find the given partial derivative of the function, we should deal with the other variables as constants. Such as when taking this partial derivative of \(\theta_{0}\), all other variables are dealt with as constants \(( y, \theta_{1}, x_{1}, …, x_{n} )\).

We can duplicate this for various other parameters like \(\theta_{1}, \theta_{2} …. \theta_{n} \) and instead of just attribute \(x_{0}^{j}\) we will  obtain features \(x_{1}^{j}, x_{2}^{j}, … ,x_{n}^{j}\).

Applying Gradient Descent in Python

Now we know the basic concept behind gradient descent and the mean squared error,  let’s implement what we have learned in Python.

Open up a new file, name it, and insert the following code:

Let’s start by importing our required Python libraries from Matplotlib, NumPy, and Seaborn.

Next, let’s arbitrarily initialize and concatenate our intercept coefficient and regression coefficient represented by \(W\) “weights.”

One thing that might puzzle you by now is, why initialize our weights arbitrarily? Why not establish it as all 0’s or 1’s?

The reason behind this logic is because:

  • If the weights in our network start too small, then the output shrinks until it’s too small to be useful.
  • If the weights in our network start too large, then the output swells until it’s too large to be useful.

So we intend to set it at about a suitable value because we want to calculate interesting functions.

Once those values have been initialized arbitrarily, let’s define some constants based on the size of our Dataset and an empty list to keep track of the cost function as it changes each iteration.

  • n : The overall number of samples
  • cost_history_list : Save the history of our cost function over after every iteration (at some point described as epochs)

While iterating, until we reach the maximum number of epochs, we calculate the estimated value y_estimated which is the dot product of our feature matrix \(X\) as well as weights \(W\).

With our given predictions, the following step is to determine the “error” or merely the difference between our actual value “y” and our estimated value “y_estimated.” Then evaluate our prediction with the overall performance of the machine learning model for the given training data with a set of weights initialized to get the cost.

The next action will be to calculate the partial derivative with respect to the weights \(W\). When we have the gradient, we need to readjust the previous values for \(W\). That is, the gradient multiplied by the learning rate alpha deducted from the previous weight matrix.

A condition statement is being utilized to print out how the cost function changes every 10 epochs on the console.

To keep track of exactly how the cost function changes over time, we include the values into a list cost_history_list on each iteration. Then return our parameters \(W\) and cost_history_list populated with the changes in our cost function over time.

For the final step, to walk you through what goes on within the main function, we generated a regression problem on lines 60 – 62. We have a total of 100 data points, each of which are 5D.

Our goal is to correctly map these randomized feature training samples \((100×5)\) (a.k.a. dependent variable) to our independent variable \((100×1)\).

Within line 65, we pass the feature matrix, target vector, learning rate, and the number of epochs to the gradient_descent function, which returns the best parameters and the saved history of the lost/cost after each iteration.

The last block of code from lines 68 – 72 aids in envisioning how the cost adjusts on each iteration.

To check the cost modifications from your command line, you can execute the following command:

gradient descent plot with python
Figure 1: Visualization of the cost function changing overtime

Observations on Gradient Descent

In my view, gradient descent is a practical algorithm; however, there is some information you should know.

Gradient Descent Linear Regression Global minimum and Local minimum
Figure 2: Visualization of the global minimum and local minimum
  1. Local Minima: It can be sensitive to initialization and can get stuck in local minima. Meaning, depending on the value given to \(\theta\) it will give us a different gradient direction leading either towards a local or global minimum.
  2. Step Size: The selection of step size can influence both the convergence rate and our procedure’s behavior. If the chosen step size is too large, we can jump over the minimum, and if it’s too small, we make very little progress increasing the amount of time it takes to converge.


To conclude this tutorial, you discovered how to implement linear regression using Gradient Descent in Python. You learned:

  • An easy brush up with the concept behind Gradient Descent.
  • How to calculate the derivative of the mean squared error.
  • How to implement Linear Regression using Gradient Descent in Python.
  • Observations behind the gradient descent algorithm.

Do you have any questions concerning this post or gradient descent? Leave a comment and ask your question. I’ll do my best to answer.

To get access to the source codes used in all of the tutorials, leave your email address in any of the page’s subscription forms.

You should click on the “Click to Tweet Button” below to share on twitter.

Check out the post on Linear Regression using Gradient Descent (GD) with Python. Share on X

Further Reading

We have listed some useful resources below if you thirst for more reading.




To be notified when this next blog post goes live, be sure to enter your email address in the form!


  1. Logan Khan Reply

    Please let me know the link to the source code of this article. I will be thankful.

    • David Praise Chukwuma Kalu Reply

      To get all access to the source code used in all tutorials, leave your email address in any of the page’s subscription forms.

  2. Heya i’m for the first time here. I found your blog and I find It really useful & it helped me out a lot.
    I hope to give something back and help others like you helped me.

    • David Praise Chukwuma Kalu Reply

      It’s honestly my pleasure Arnulfo.

  3. Incredible points. Sound arguments. Keep up the amazing effort.

    • David Praise Chukwuma Kalu Reply

      Thank you very much Mohammed for your kind feedback.

  4. Hello
    Thank you very much for your super useful tutorial. I just don’t get how you move from equation 1.4 to 1.5, could you explain it to me please ?
    Thanks again


    • David Praise Chukwuma Kalu Reply

      Since the value 2 is a constant, it is brought outside of the summation sign. Then, we will have (2*1) / (2*m), which can be simplified into 1/m.

  5. Hi there! Someone in my Facebook group shared this website with us so I came to look it over.
    I’m definitely loving the information. I’m bookmarking and will be tweeting this
    to my followers! Terrific blog and amazing style and design.

    • David Praise Chukwuma Kalu Reply

      Thank you very much, Benny, for your kind words, and I appreciate your honest feedback. Stay tuned for more tutorials on the blog, and don’t forget to subscribe to my Youtube channel, as I recently started releasing a bunch of video tutorials.

      Youtube Channel:

  6. Hi, I do not understand how this calculates the gradient:

    gradient = (1 / m) *

    Is the gradient just as simple as the dot product of the features and the error?

    How is the gradient the same size as W. Let’s say you have 2 features and 4 samples, X is 4×2 matrix, and the error is 1×4. The gradient would be a 1×2. What if our W is of any size other than that?

    • David Praise Chukwuma Kalu Reply

      Hello Saleh, and thanks for asking your question.

      Yes, the gradient is simply the dot product between the features matrix (Transposed) and the error (the squared difference between the actual label and predicted value).

      Regarding your question about whether the calculation is possible if W (the weighted matrix) is of any size. Well, if could be as long as this rule when performing matrix multiplication hold which states:

      – Two matrices can be multiplied only when the number of columns in the first equals the number of rows in the second.

      If the statement doesn’t hold, then it’s not possible.

  7. Hi, I have some questions

    1. How do you make sure the gradient is the same size as W. Shouldn’t every gradient[i] correspond to the weight W[i]?
    gradient = (1 / m) *

    Lets say there are 4 samples and 2 features, so X is a 4×2 matrix. The error is 1×4. The gradient would be 1×2. But what if W is another size?

    • David Praise Chukwuma Kalu Reply

      Thanks again, Saleh, for asking your question.

      We used vectorization to speed up the Python code without using for loop ( method). Which will help minimize the runtime of our code, making it much efficient.

      I’ll strongly recommend you implement for yourself the same procedure using Python for loop and time how long each operation takes the run, using a larger matrix.

      A great way to assert if “X” and “W” are the exact sizes, is to use the assert keyword available in Python.

Write A Comment