In the last two tutorials, I have specified the equation of the mean squared error $J(\theta)$, which measures how wrong the model is in modeling the relationship between $ X $ and $y$.
In continuation of the previous tutorial behind the gradient descent algorithm, you will undoubtedly learn how to perform linear regression using gradient descent in Python on a new cost function $J(\theta)$ the “mean square error”.
Not only will you get a greater understanding of the algorithm, which I had guaranteed, yet additionally execute a Python script yourself.
First, let’s review some ideas concerning the topic.
What we want to achieve is to update our $\theta$ value in a way that the cost function gradually decreases over time.
To accomplish this, what should we do actually? This is a specific questions we have to decide on. Should we reduce or increase the values of $\theta$ ?
Considering the graph, we must take a step in the negative direction (going downwards).
Every step we take always provides us a new value of $\theta$. When we obtain a new value, we need to constantly keep track of the gradient, and evaluate it gradient with newfound value, then repeat taking the same step until we are at the global minimum (the blue dot).
A straightforward illustration of the gradient descent algorithm is as follow:
Initialize theta
# keep updating our theta values while we haven't converged
do {
theta = theta - alpha * J(theta)
} while(Not converged)
$J(\theta) = \frac{1}{2m} \sum_{j} (\theta x^{(j)} – y^{(j)})^ 2$ [1.0]
For linear regression, we can compute the Mean Squared Error cost function and then compute its gradient.
$J(\theta) = \frac{1}{2m} \sum_{j} (\theta_{0}x_{0}^{(j)}\hspace{2mm} + \hspace{2mm}\theta_{1}x_{1}^{(j)}\hspace{2mm} +\hspace{2mm} … \hspace{2mm}\theta_{n}x_{n}^{(j)} – y^{(j)})^ 2$ [1.1]
Let’s expand our prediction right into each term and currently signify the error residual $e_{j}(\theta)$ to simplify the way we will write it.
Taking the derivative of this equation sounds a little more tricky if you’ve been out of calculus class for a while. However, don’t stress too much. I’ve got you covered. Let’s begin by showing you the derivation procedure.
$\frac{\partial J(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \frac{1}{2m} \sum_{j} e_{j}(\theta)^2$ [1.2]
$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \sum_{j} \frac{\partial}{\partial \theta} e_{j}(\theta)^2$ [1.3]
To move from formula [1.2] to [1.3], we need to use two necessary derivative regulations. The scalar multiple and sum rule:
Scalar multiple rule: $\hspace{10mm} \frac{d}{dx} (\alpha\hspace{0.3mm} u) = \alpha \frac{du}{dx}$
Sum rule: $ \hspace{10mm} \frac{d}{dx} (\sum \hspace{0.3mm} u) = \sum \frac{du}{dx}$
$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \sum_{j} \hspace{1mm}2 e_{j}\hspace{1mm}(\theta)\hspace{1mm} \frac{\partial}{\partial \theta}\hspace{1mm} e_{j}(\theta)$ [1.4]
Next what we did to get from [1.3] to [1.4], is we applied both the power rule and the chain rule. If you’re not accustomed to the guidelines, you may discover them below:
Power rule: $\hspace{10mm} \frac{d}{dx} u ^n = nu^{n-1} \frac{du}{dx}$
Chain rule: $\hspace{10mm} \frac{d}{dx} f(g(x)) = {f}'(g(x))\hspace{1mm}{g}'(x)$
$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \sum_{j} e_{j}(\theta)$ [1.5]
Lastly, to go from [1.4] to [1.5], equation [1.5] gives us the partial derivative of the MSE $J(\theta)$ with respect to our $\theta$ variables.
We can assess by taking the partial derivative of the MSE with respect to $\theta_{0}$.
$\frac{\partial}{\partial \theta_{0}} e_{j}(\theta) = \frac{\partial}{\partial \theta_{0}}\theta_{0}x_{0}^{(j)} + \frac{\partial}{\partial \theta_{0}}\theta_{1}x_{1}^{(j)} + …. + \frac{\partial}{\partial \theta_{0}}\theta_{n}x_{n}^{(j)} \hspace{1mm}-\hspace{1mm} y^{(j)}= x_{0}^{(j)} $
To find the given partial derivative of the function, we should deal with the other variables as constants. Such as when taking this partial derivative of $\theta_{0}$, all other variables are dealt with as constants $( y, \theta_{1}, x_{1}, …, x_{n} )$.
We can duplicate this for various other parameters like $\theta_{1}, \theta_{2} …. \theta_{n} $ and instead of just attribute $x_{0}^{j}$ we will obtain features $x_{1}^{j}, x_{2}^{j}, … ,x_{n}^{j}$.
Now we know the basic concept behind gradient descent and the mean squared error, let’s implement what we have learned in Python.
Open up a new file, name it linear_regression_gradient_descent.py, and insert the following code:
# import the necessary packages
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
sns.set(style='darkgrid')
Let’s start by importing our required Python libraries from Matplotlib, NumPy, and Seaborn.
def gradient_descent(X, y, alpha=0.01, epochs=30):
"""
:param x: feature matrix
:param y: target vector
:param alpha: learning rate (default:0.01)
:param epochs: maximum number of iterations of the
linear regression algorithm for a single run (default=30)
:return: weights, list of the cost function changing overtime
"""
m = np.shape(X)[0] # total number of samples
n = np.shape(X)[1] # total number of features
X = np.concatenate((np.ones((m, 1)), X), axis=1)
W = np.random.randn(n + 1, )
# stores the updates on the cost function (loss function)
cost_history_list = []
Next, let’s arbitrarily initialize and concatenate our intercept coefficient and regression coefficient represented by $W$ “weights.”
One thing that might puzzle you by now is, why initialize our weights arbitrarily? Why not establish it as all 0’s or 1’s?
The reason behind this logic is because:
So we intend to set it at about a suitable value because we want to calculate interesting functions.
Once those values have been initialized arbitrarily, let’s define some constants based on the size of our Dataset and an empty list to keep track of the cost function as it changes each iteration.
# iterate until the maximum number of epochs
for current_iteration in np.arange(epochs): # begin the process
# compute the dot product between our feature 'X' and weight 'W'
y_estimated = X.dot(W)
# calculate the difference between the actual and predicted value
error = y_estimated - y
# calculate the cost (Mean squared error - MSE)
cost = (1 / 2 * m) * np.sum(error ** 2)
While iterating, until we reach the maximum number of epochs, we calculate the estimated value y_estimated which is the dot product of our feature matrix $X$ as well as weights $W$.
With our given predictions, the following step is to determine the “error” or merely the difference between our actual value “y” and our estimated value “y_estimated.” Then evaluate our prediction with the overall performance of the machine learning model for the given training data with a set of weights initialized to get the cost.
# Update our gradient by the dot product between
# the transpose of 'X' and our error divided by the
# total number of samples
gradient = (1 / m) * X.T.dot(error)
# Now we have to update our weights
W = W - alpha * gradient
The next action will be to calculate the partial derivative with respect to the weights $W$. When we have the gradient, we need to readjust the previous values for $W$. That is, the gradient multiplied by the learning rate alpha deducted from the previous weight matrix.
# Let's print out the cost to see how these values
# changes after every 10th iteration
if current_iteration % 10 == 0:
print(f"cost:{cost} \t iteration: {current_iteration}")
# keep track the cost as it changes in each iteration
cost_history_list.append(cost)
return W, cost_history_list
A condition statement is being utilized to print out how the cost function changes every 10 epochs on the console.
To keep track of exactly how the cost function changes over time, we include the values into a list cost_history_list on each iteration. Then return our parameters $W$ and cost_history_list populated with the changes in our cost function over time.
def main():
rng = np.random.RandomState(10)
X = 10 * rng.rand(100, 5) # feature matrix
y = 0.9 + np.dot(X, [2.2, 4., -4, 1, 2]) # target vector
# calls the gradient descent method
weight, cost_history_list = gradient_descent(X, y, alpha=0.01,
epochs=100)
# visualize how our cost decreases over time
plt.plot(np.arange(len(cost_history_list)), cost_history_list)
plt.xlabel("Number of iterations (Epochs)")
plt.ylabel("Cost function J(Θ)")
plt.title("Gradient Descent")
plt.show()
if __name__ == '__main__':
main()
For the final step, to walk you through what goes on within the main function, we generated a regression problem on lines 2 – 4. We have a total of 100 data points, each of which are 5D.
Our goal is to correctly map these randomized feature training samples $(100×5)$ (a.k.a. dependent variable) to our independent variable $(100×1)$.
Within line 7, we pass the feature matrix, target vector, learning rate, and the number of epochs to the gradient_descent function, which returns the best parameters and the saved history of the lost/cost after each iteration.
The last block of code from lines 10 – 14 aids in envisioning how the cost adjusts on each iteration.
To check the cost modifications from your command line, you can execute the following command:
python linear_regression_gradient_descent.py
cost:2870624.491941226 iteration: 0
cost:145607.02085484721 iteration: 10
cost:17394.518134046495 iteration: 20
cost:2149.654826207572 iteration: 30
cost:326.1941755250162 iteration: 40
cost:105.19630973147717 iteration: 50
cost:77.23993590601086 iteration: 60
cost:72.96453135073557 iteration: 70
cost:71.71480871397247 iteration: 80
cost:70.89080664015975 iteration: 90
In my view, gradient descent is a practical algorithm; however, there is some information you should know.
To conclude this tutorial, you discovered how to implement linear regression using Gradient Descent in Python. You learned:
Do you have any questions concerning this post or gradient descent? Leave a comment and ask your question. I’ll do my best to answer.
To get all access to the source code used in all tutorials, leave your email address in any of the page’s subscription forms.
You should click on the “Click to Tweet Button” below to share on twitter.
Check out the post on Linear Regression using Gradient Descent (GD) with Python.
We have listed some useful resources below if you thirst for more reading.
To be notified when this next blog post goes live, be sure to enter your email address in the form below!
Upon entering your email for new visitors, you will receive memorable infographics just for you by your email and all access to the source code used in all tutorials. Be sure to confirm to respect the privacy of everyone.
He's an entrepreneur who loves Computer Vision and Machine Learning.
Get weekly data science tips from David Praise that keeps you more informed. It’s data science school in bite-sized chunks!
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
Please let me know the link to the source code of this article. I will be thankful.
To get all access to the source code used in all tutorials, leave your email address in any of the page’s subscription forms.
Great info. I’ve been seeking for this content.
Thank you very much Alex.