One of the most common types of regularization techniques shown to work well is the L2 Regularization.
In today’s tutorial, we will grasp this technique’s fundamental knowledge shown to work well to prevent our model from overfitting. Once you complete reading the blog, you will know that the:
To get a better idea of what this means, continue reading.
We have discussed in previous blog posts regarding how gradient descent works, linear regression using gradient descent and stochastic gradient descent over the past weeks. We have seen first hand how these algorithms are built to learn the relationships within our data by iteratively updating their weight parameters.Â
While the weight parameters are updated after each iteration, it needs to be appropriately tuned to enable our trained model to generalize or model the correct relationship and make reliable predictions on unseen data.
Most importantly, besides modeling the correct relationship, we also need to prevent the model from memorizing the training set. And one critical technique that has been shown to avoid our model from overfitting is regularization. The other parameter is the learning rate; however, we mainly focus on regularization for this tutorial.
Note: If you don’t understand the logic behind overfitting, refer to this tutorial.Â
We also have to be careful about how we use the regularization technique. If too much of regularization is applied, we can fall under the trap of underfitting.
$J(\theta) = \frac{1}{2m} \sum_{i}^{m} (h_{\theta}(x^{(i)}) – y^{(i)}) ^2 + \frac{\lambda}{2m} \sum_{j}^{n}\theta_{j}^{(2)}$
Â
Here’s the equation of our cost function with the regularization term added. By taking the derivative of the regularized cost function with respect to the weights we get:
Â
$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \sum_{j} e_{j}(\theta) + \frac{\lambda}{m} \theta$
It’s essential to know that the Ridge Regression is defined by the formula which includes two terms displayed by the equation above:
The second term looks new, and this is our regularization penalty term, which includes đťś† and the slope squared.
Â You might notice a squared value withinÂ the second termÂ of the equation and what this does is it adds a penalty to our cost/loss function, andÂ đťś†Â determines how effective the penalty will be.
For the lambda value, it’s important to have this concept in mind:
If đťś† is too large, the penalty value will be too much, and the line becomes less sensitive.
If đťś† is low, the penalty value will be less, and the line does not overfit the training data.
To choose the appropriate value for lambda, I will suggest you perform a cross-validation technique for different values of lambda and see which one gives you the lowest variance.
Now that we understand the essential concept behindÂ regularizationÂ letâ€™s implement this in Python on a randomized data sample.
Open up a brand new file, name it ridge_regression_gd.py, and insert the following code:
# import the necessary packages
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
sns.set(style='darkgrid')
Let’s begin by importing our needed Python libraries fromÂ NumPy, SeabornÂ and Matplotlib.
def ridge_regression(X, y, alpha=0.01, lambda_value=1, epochs=30):
"""
:param x: feature matrix
:param y: target vector
:param alpha: learning rate (default:0.01)
:param lambda_value: lambda (default:1)
:param epochs: maximum number of iterations of the
linear regression algorithm for a single run (default=30)
:return: weights, list of the cost function changing overtime
"""
m = np.shape(X)[0] # total number of samples
n = np.shape(X)[1] # total number of features
X = np.concatenate((np.ones((m, 1)), X), axis=1)
W = np.random.randn(n + 1, )
# stores the updates on the cost function (loss function)
cost_history_list = []
# iterate until the maximum number of epochs
for current_iteration in np.arange(epochs): # begin the process
# compute the dot product between our feature 'X' and weight 'W'
y_estimated = X.dot(W)
# calculate the difference between the actual and predicted value
error = y_estimated - y
# regularization term
ridge_reg_term = (lambda_value / 2 * m) * np.sum(np.square(W))
# calculate the cost (MSE) + regularization term
cost = (1 / 2 * m) * np.sum(error ** 2) + ridge_reg_term
# Update our gradient by the dot product between
# the transpose of 'X' and our error + lambda value * W
# divided by the total number of samples
gradient = (1 / m) * (X.T.dot(error) + (lambda_value * W))
# Now we have to update our weights
W = W - alpha * gradient
# Let's print out the cost to see how these values
# changes after every iteration
print(f"cost:{cost} \t iteration: {current_iteration}")
# keep track the cost as it changes in each iteration
cost_history_list.append(cost)
return W, cost_history_list
Within the ridge_regression function, we performed some initialization.Â
For an extra thorough evaluation of this area, please see this tutorial.
This snippet’s major difference is the highlighted section above fromÂ lines 34 – 43, including the regularization term to penalize large weights, improving the ability for our model to generalize and reduce overfitting (variance).Â
def main():
rng = np.random.RandomState(1)
x = 15 * rng.rand(50)
X = x.reshape(-1, 1)
y = 2 * x - 1 + rng.randn(50)
lambda_list = [0.01, 5000]
for lambda_ in lambda_list:
# calls ridge regression function with different values of lambda
weight, _ = ridge_regression(X, y, alpha=0.01,
lambda_value=lambda_, epochs=5)
fitted_line = np.dot(X, weight[1]) + weight[0]
plt.scatter(X, y, label='data points')
plt.plot(X, fitted_line, color='r', label='Fitted line')
plt.xlabel("X")
plt.ylabel("y")
plt.title(f"Ridge Regression (lambda : {lambda_})")
plt.legend()
plt.show()
if __name__ == '__main__':
main()
For the final step, to walk you through what goes on within the main function, we generated a regression problem onÂ linesÂ 2 – 6.
WithinÂ line 8, we created a list of lambda values which are passed as an argument on line 13. Then the last block of code from lines 16 – 23 helps in envisioning how the line fits the data-points with different values of lambda.
To visualize the plot, you can execute the following command:
python ridge_regression_gd.py
To summarize the difference between the two plots above, using different values of lambda, will determine what and how much the penalty will be. As we can see from the second plot, using a large value of lambda, our model tends to under-fit the training set.
Here are three common types of Regularization techniques you will commonly see applied directly to our loss function:
In this post, you discovered the underlining concept behind Regularization and how to implement it yourself from scratch to understand how the algorithm works. You now know that:
Do you have any questions about Regularization or this post? Leave a comment and ask your question. I’ll do my best to answer.
To getÂ access to the source codes used in all of the tutorials, leave your email address in any of the pageâ€™s subscription forms.
You should click on the â€śClick to Tweet Buttonâ€ť below to share on twitter.
Check out the post on how to implement l2 regularization with python
We have listed some useful resources below if you thirst for more reading.
To be notified when this next blog post goes live, be sure to enter your email address in the form below!
He's an entrepreneur who loves Computer Vision and Machine Learning.
Get weekly data science tips from David Praise that keeps you more informed. It’s data science school in bite-sized chunks!
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
Nice post. I used to be checking constantly this weblog and I am impressed!
Extremely useful information specially the ultimate section :
) I maintain such information much. I used to be looking
for this particular information for a very lengthy time.
Thank you and good luck.
Thank you Ashlee.