How to Implement L2 Regularization with Python

L2 regularization with Python Thumbnail
Figure 1: Showing how different values of lambda affect the Regression line.

One of the most common types of regularization techniques shown to work well is the L2 Regularization.

In today’s tutorial, we will grasp this technique’s fundamental knowledge shown to work well to prevent our model from overfitting. Once you complete reading the blog, you will know that the:

  • L2 Regularization takes the sum of square residuals + the squares of the weights * 𝜆 (read as lambda).
  • Essential concepts and terminology you must know. 
  • How to implement the regularization term from scratch.
  • Finally, other types of regularization techniques. 

To get a better idea of what this means, continue reading.

What is Regularization and Why Do We Need It?

We have discussed in previous blog posts regarding how gradient descent works, linear regression using gradient descent and stochastic gradient descent over the past weeks. We have seen first hand how these algorithms are built to learn the relationships within our data by iteratively updating their weight parameters. 

While the weight parameters are updated after each iteration, it needs to be appropriately tuned to enable our trained model to generalize or model the correct relationship and make reliable predictions on unseen data.

Most importantly, besides modeling the correct relationship, we also need to prevent the model from memorizing the training set. And one critical technique that has been shown to avoid our model from overfitting is regularization. The other parameter is the learning rate; however, we mainly focus on regularization for this tutorial.

Note: If you don’t understand the logic behind overfitting, refer to this tutorial

We also have to be careful about how we use the regularization technique. If too much of regularization is applied, we can fall under the trap of underfitting.

Ridge Regression

$J(\theta) = \frac{1}{2m} \sum_{i}^{m} (h_{\theta}(x^{(i)}) – y^{(i)}) ^2 + \frac{\lambda}{2m} \sum_{j}^{n}\theta_{j}^{(2)}$

 

Here’s the equation of our cost function with the regularization term added. By taking the derivative of the regularized cost function with respect to the weights we get:

 

$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \sum_{j} e_{j}(\theta) + \frac{\lambda}{m} \theta$

It’s essential to know that the Ridge Regression is defined by the formula which includes two terms displayed by the equation above:

  • The first term looks very familiar to what we have seen in this tutorial which is the average cost/loss over all the training set (sum of squared residuals)
  • The second term looks new, and this is our regularization penalty term, which includes 𝜆 and the slope squared. 

 You might notice a squared value within the second term of the equation and what this does is it adds a penalty to our cost/loss function, and 𝜆 determines how effective the penalty will be.

For the lambda value, it’s important to have this concept in mind:

  • If 𝜆 is too large, the penalty value will be too much, and the line becomes less sensitive.

  • If 𝜆=0, we are only minimizing the first term and excluding the second term.
  • If 𝜆 is low, the penalty value will be less, and the line does not overfit the training data.

To choose the appropriate value for lambda, I will suggest you perform a cross-validation technique for different values of lambda and see which one gives you the lowest variance.

Applying Ridge Regression with Python

Now that we understand the essential concept behind regularization let’s implement this in Python on a randomized data sample.

Open up a brand new file, name it ridge_regression_gd.py, and insert the following code:

# import the necessary packages
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

sns.set(style='darkgrid')

Let’s begin by importing our needed Python libraries from NumPy, Seaborn and Matplotlib.

def ridge_regression(X, y, alpha=0.01, lambda_value=1, epochs=30):
    """
    :param x: feature matrix
    :param y: target vector
    :param alpha: learning rate (default:0.01)
    :param lambda_value: lambda (default:1)
    :param epochs: maximum number of iterations of the
           linear regression algorithm for a single run (default=30)
    :return: weights, list of the cost function changing overtime
    """

    m = np.shape(X)[0]  # total number of samples
    n = np.shape(X)[1]  # total number of features

    X = np.concatenate((np.ones((m, 1)), X), axis=1)
    W = np.random.randn(n + 1, )

    # stores the updates on the cost function (loss function)
    cost_history_list = []

    # iterate until the maximum number of epochs
    for current_iteration in np.arange(epochs):  # begin the process

        # compute the dot product between our feature 'X' and weight 'W'
        y_estimated = X.dot(W)

        # calculate the difference between the actual and predicted value
        error = y_estimated - y

        # regularization term
        ridge_reg_term = (lambda_value / 2 * m) * np.sum(np.square(W))

        # calculate the cost (MSE) + regularization term
        cost = (1 / 2 * m) * np.sum(error ** 2) + ridge_reg_term

        # Update our gradient by the dot product between
        # the transpose of 'X' and our error + lambda value * W
        # divided by the total number of samples
        gradient = (1 / m) * (X.T.dot(error) + (lambda_value * W))

        # Now we have to update our weights
        W = W - alpha * gradient

        # Let's print out the cost to see how these values
        # changes after every iteration
        print(f"cost:{cost} \t iteration: {current_iteration}")

        # keep track the cost as it changes in each iteration
        cost_history_list.append(cost)

    return W, cost_history_list

Within the ridge_regression function, we performed some initialization. 

For an extra thorough evaluation of this area, please see this tutorial.

This snippet’s major difference is the highlighted section above from lines 34 – 43, including the regularization term to penalize large weights, improving the ability for our model to generalize and reduce overfitting (variance). 

def main():
    rng = np.random.RandomState(1)
    x = 15 * rng.rand(50)
    X = x.reshape(-1, 1)

    y = 2 * x - 1 + rng.randn(50)

    lambda_list = [0.01, 5000]

    for lambda_ in lambda_list:
        
        # calls ridge regression function with different values of lambda
        weight, _ = ridge_regression(X, y, alpha=0.01, 
					lambda_value=lambda_, epochs=5)

        fitted_line = np.dot(X, weight[1]) + weight[0]
        plt.scatter(X, y, label='data points')
        plt.plot(X, fitted_line, color='r', label='Fitted line')
        plt.xlabel("X")
        plt.ylabel("y")
        plt.title(f"Ridge Regression (lambda : {lambda_})")
        plt.legend()
        plt.show()


if __name__ == '__main__':
    main()

For the final step, to walk you through what goes on within the main function, we generated a regression problem on lines 2 – 6.

Within line 8, we created a list of lambda values which are passed as an argument on line 13. Then the last block of code from lines 16 – 23 helps in envisioning how the line fits the data-points with different values of lambda.

To visualize the plot, you can execute the following command:

python ridge_regression_gd.py
Ridge Regression Gradient Desent Plot with lambda=0.01
Figure 2: Fitted line with lambda = 0.01
Ridge Regression Gradient Desent Plot with lambda=5000
Figure 3: Fitted line with lambda = 5000

To summarize the difference between the two plots above, using different values of lambda, will determine what and how much the penalty will be. As we can see from the second plot, using a large value of lambda, our model tends to under-fit the training set.

Types of Regularization Techniques

Here are three common types of Regularization techniques you will commonly see applied directly to our loss function:

Conclusion

In this post, you discovered the underlining concept behind Regularization and how to implement it yourself from scratch to understand how the algorithm works. You now know that:

  • L2 Regularization takes the sum of square residuals + the squares of the weights * lambda.
  • How to choose the perfect lambda value.
  • How to implement the regularization term from scratch in Python.
  • And a brief touch on other regularization techniques.

Do you have any questions about Regularization or this post? Leave a comment and ask your question. I’ll do my best to answer.

To get access to the source codes used in all of the tutorials, leave your email address in any of the page’s subscription forms.

You should click on the “Click to Tweet Button” below to share on twitter.

Check out the post on how to implement l2 regularization with python

Further Reading

We have listed some useful resources below if you thirst for more reading.

Articles

Books

To be notified when this next blog post goes live, be sure to enter your email address in the form below!

Share

Share on facebook
Share on twitter
Share on linkedin
Share on reddit

Speak Your Mind

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. Nice post. I used to be checking constantly this weblog and I am impressed!
    Extremely useful information specially the ultimate section :
    ) I maintain such information much. I used to be looking
    for this particular information for a very lengthy time.

    Thank you and good luck.

About Me

David Praise Chukwuma Kalu

David Praise Chukwuma Kalu

He's an entrepreneur who loves Computer Vision and Machine Learning.

Read more about him

Get the cheatsheet I wish I had before starting my career as a

About Me

David Praise Chukwuma Kalu

David Praise Chukwuma Kalu

He's an entrepreneur who loves Computer Vision and Machine Learning.

Get weekly data science tips from David Praise that keeps you more informed. It's data science school in bite-sized chunks!