The common question you usually hear is, is Logistic Regression a Regression algorithm as the name says?
Last week I decided to run a poll over Twitter about the Logistic Regression Algorithm, and around 64.1% of the audience got the answer correctly.
What is Logistic Regression? 👀#Neuraspike #MachineLearning #DeepLearning #DataScience #Python
— David Praise Chukwuma Kalu (@DavidPraiseKalu) October 10, 2020
In today’s tutorial, we will grasp this fundamental concept of what Logistic Regression is and how to think about it. We will also see some mathematical formulas and derivations, then a walkthrough through the algorithm’s implementation with Python from scratch. Finally, some pros and cons behind the algorithm.
To get a better understanding, continue reading.
Before continuing with the tutorial, yesterday was my birthday, and I love to thank everyone who sent birthday wishes. If you would like to support me on this cheerful day, and share the joy, I accept all kinds of gifts, God bless.
As the name states, it’s one of the most poorly named algorithms in the field of machine learning. By thinking of the name, you might assume it’s one of the regression methods. However, this isn’t true. It’s a classification method.
Other than what I’ve mentioned don’t be confused, it’s one of the most widely used classification algorithms in medicine, for instance, if a patient is likely to die due to some particular pathological state. In finance, if a bank will lose a customer due to the services provided, etc.
Lastly, what I love about Logistic Regression is that, not only is it a classification method, yet, it builds perfect knowledge of what neural networks are, which we will discuss in future tutorials.
Before going into the tech talk behind the algorithm, let’s walk you through an example. Let’s say you want to build a machine learning model to predict if a customer is likely to cancel their monthly subscription based on the services provided to them. Having a solution is crucially vital for large subscription businesses to identify customers most at-risk of churning.
Now we want to model the probability(churn | x) “the probability the customer is likely to churn given some data about the person.”
Let’s say this $ X $ feature matrix consist of $ x_{1}$ which is the person’s age, $x_{2}$ = monthly charges given to a person, $x_{3}$ = internet speed plan offered. Now taking a linear combination of these feature variables provided gives us
$ w_{0}x_{0} + w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} = W^{T} X $
by transforming it into a vector where $ X = [1, x1, x2, x3] $
The main issue behind the formula we have modeled above is, it isn’t a probability. It’s merely a plane, as we have seen in the tutorials related to linear regression.
The good news is we can fix this by passing our equation through a curve called the sigmoid function (An “S-shaped curve”). Once that’s complete, we can then model our probability this way.
By taking a closer look at the weights, if the $w_{1}$ is positively enormous, it surely will increase the probability the user is likely to churn. If $w_{3}$ is negatively large, it’s decreasing the likelihood the user is likely to churn.
Here’s the code snippet used in visualizing the sigmoid function.
# import matplotlib, numpy
from matplotlib import pyplot as plt
import numpy as np
# create evenly spaced numbers over a specified interval
x = np.linspace(-5, 5, 100)
z = 1/(1 + np.exp(-x))
# visualizing the plot
plt.plot(x, z)
plt.xlabel("x")
plt.ylabel("Sigmoid(x)")
plt.show()
The question now is given our training data, how do we search for the best parameter for our $w$ given the training data? We must define a cost function that explains how good or bad a chosen $w$ is and for this, logistic regression uses the maximum likelihood estimate.
Which is the p(y | X, W), reads as “the probability a customer will churn given a set of parameters”.
p(churn|w) = $\prod_{i=1}^{m} p (\hat{y_{i}} \hspace{1mm}| \hspace{1mm}x_{i}, w_{i})^{\hspace{0.2mm}y_{i}} \hspace{2mm}p(1 – \hat{y_{i}})^{\hspace{0.2mm}1 – y_{i}} $
Given $\hat{y} = \sigma (W^{\hspace{0.1mm}T}\hspace{0.1mm}X)$
Next we want to maximize this function by taking the negative log-likelihood of this function, since we can’t solve it for $w$ as, $w$ is wrapped inside a non-linear function.
$L(w)$ = -log p(churn | x, w) = $ – \sum_{i=1}^{n} y_{i}\hspace{0.7mm} log\hspace{0.5mm}(\hat{y_{i}}) + (1- y_{i})\hspace{0.5mm} log \hspace{0.5mm}({1 – \hat{y_{i}}})$
To solve the derivative of this equation above which we need, let’s compute some steps then later plug them right into our equation.
Let’s solve for $log(\hat{y})$:
$log\hspace{1mm} \hat{y}$ = $ log\left ( \sigma(w^{T}x) \right )$
= $ log\left ( \frac{1}{1+e^{-w^Tx}} \right )$
= $log\left ( 1 \right )\hspace{1mm} – \hspace{1mm} log\left ( 1+e^{-w^Tx} \right )$
= $- log\left (1+e^{-w^Tx} \right )$
$\frac{\partial}{\partial w_{j}} \hspace{1mm} log \hspace{1mm}\hat{y}$ = $\frac{x_{j} e^{-w^{T}x}}{1 + e^{-w^{T}x}}$ = $x_{j}\left ( 1- \hat{y} \right )$
Let’s solve for $log(1 – \hat{y})$:
$log\hspace{1mm} (1 – \hat{y}) = -w^{T}x \hspace{1mm} – \hspace{1mm} log\left ( 1+e^{-w^Tx} \right )$
$\frac{\partial}{\partial w_{j}} \hspace{1mm} log\left ( 1-\hat{y} \right ) = -x_{j} + x_{j}\left (1-\hat{y} \right ) = -\hspace{1mm}\hat{y}\hspace{1mm}x_{j}$
After we have solved for both $log(\hat{y})$ and $log(1 – \hat{y})$ then taken the partial derivative, let’s now compute the derivative of our new cost/loss function $L(w)$ then put these values already calculated into the equation.
$L(w)$ = $ – \sum_{m}^{i=1} y_{i}\hspace{0.7mm} log\hspace{0.5mm}(\hat{y_{i}}) + (1-y_{i})\hspace{0.5mm} log \hspace{0.5mm}({1 – \hat{y_{i}}})$
$\frac{\partial}{\partial w_{j}} \hspace{1mm} L(w_{j}) = – \sum_{i=1}^{m} y_{i} \hspace{1mm} x_{ij}\hspace{0.5mm}(1 – \hat{y}_{i}) \hspace{1mm}- \hspace{1mm}(1-y_{i})\hspace{1mm}x_{ij}\hspace{1mm}\hat{y_{i}}$
$\frac{\partial}{\partial w_{j}} \hspace{1mm} L(w_{j})$ = $- \sum_{i=1}^{m} \left ( y_{i} \hspace{1mm} x_{ij}\hspace{1mm} – \hspace{1mm}y_{i} \hspace{0.5mm} x_{ij} \hspace{0.5mm}\hat{y_{i}} \hspace{1mm}- \hspace{1mm} x_{ij} \hat{y_{i}}\hspace{1mm}+\hspace{1mm} y_{i} \hspace{0.5mm} x_{ij}\hat{y_{i}}\right )$
$\frac{\partial}{\partial w_{j}} \hspace{1mm} L(w_{j}) = \sum_{i=1}^{m} \left ( \hat{y_{i}} – y_{i} \hspace{1mm}\right)x_{ij}$
Now writing a vectorized version of the result above will transform into:
$\frac{\partial}{\partial w} \hspace{1mm} L(w) = X^{T}\left ( \hat{y} – y \hspace{1mm}\right)$
Now that we understand the essential concepts behind logistic regression let’s implement this in Python on a randomized data sample.
Open up a brand new file, name it logistic_regression_gd.py, and insert the following code:
# import the necessary packages
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
import seaborn as sns
sns.set(style='darkgrid')
Let’s begin by importing our needed Python libraries from NumPy, Seaborn, SkLearn, and Matplotlib.
def sigmoid(z):
"""
:param z: input value
:return: the sigmoid activation value for a given input value
"""
return 1 / (1 + np.exp(-z))
Next, let’s define the sigmoid activation function we have discussed above.
Within the logistic_regression function we have provided an extra thorough evaluation of this area, in this tutorial.
def logistic_regression(X, y, alpha=0.01, epochs=30):
"""
:param x: feature matrix
:param y: target vector
:param alpha: learning rate (default:0.01)
:param epochs: maximum number of iterations of the
logistic regression algorithm for a single run (default=30)
:return: weights, list of the cost function changing overtime
"""
m = np.shape(X)[0] # total number of samples
n = np.shape(X)[1] # total number of features
X = np.concatenate((np.ones((m, 1)), X), axis=1)
W = np.random.randn(n + 1, )
# stores the updates on the cost function (loss function)
cost_history_list = []
# iterate until the maximum number of epochs
for current_iteration in np.arange(epochs): # begin the process
# compute the dot product between our feature 'X' and weight 'W'
# then passed the value into our sigmoid activation function
y_estimated = sigmoid(X.dot(W))
# calculate the difference between the actual and predicted value
error = y_estimated - y
# calculate the cost (Maximum likelihood)
cost = np.mean(-y * np.log(y_estimated) - (1 - y) * \
np.log(1-y_estimated))
# Update our gradient by the dot product between
# the transpose of 'X' and our error divided by the
# total number of samples
gradient = (1 / m) * X.T.dot(error)
# Now we have to update our weights
W = W - alpha * gradient
# Let's print out the cost to see how these values
# changes after every 10th iteration
if current_iteration % 10 == 0:
print(f"cost:{cost} \t iteration: {current_iteration}")
# keep track the cost as it changes in each iteration
cost_history_list.append(cost)
return W, cost_history_list
The only difference within this section of the code is the calculation done when computing the cost function. Rather than using the mean squared error as discussed when working with Linear Regression, we use the maximum likelihood estimation.
def main():
# generate a binary classification probelm with 150 samples,
# where each of the samples is a 2D feature vector
(X, y) = make_blobs(n_samples=150, centers=2, n_features=2,
random_state=20)
# calls the logistic regression function
weight, cost_history_list = logistic_regression(X, y, alpha=0.01,
epochs=100)
# compute the line of best fit by setting the sigmoid function
# to 0; 0 = w0 + w1*x + w2*y and solving for X2
# in terms of X1 ==> y = (-w0 - (w1*x)) / w2
(W0, W1, W2) = weight
Y = (-W0 - (W1 * X)) / W2
# plot the original data along with our line of best fit
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cm.jet)
plt.plot(X, Y, "r-")
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.show()
# visualize how our cost decreases over time
plt.plot(np.arange(len(cost_history_list)), cost_history_list)
plt.xlabel("Number of iterations (Epochs)")
plt.ylabel("Cost function J(Θ)")
plt.title("Training Loss")
plt.show()
if __name__ == '__main__':
main()
For the final step, to walk you through what goes on within the main function, we generated a 2D classification problem on line 4.
Within line 7, we called the logistic regression function and passed in as arguments the learning rate (alpha) and the number of iterations (epochs).
The last block of code from lines 12 – 27 helps envision how the line fits the data-points and the cost function as it changes within each iteration.
To visualize the plots, you can execute the following command:
python logistic_regression_gd.py
cost:8.717283298964574 iteration: 0
cost:6.150359538880984 iteration: 10
cost:3.5851526065213877 iteration: 20
cost:1.1520496009688626 iteration: 30
cost:0.2428097273192674 iteration: 40
cost:0.1574861180320667 iteration: 50
cost:0.13914352445338024 iteration: 60
cost:0.12951331940791816 iteration: 70
cost:0.12207152058668873 iteration: 80
cost:0.11568313090569869 iteration: 90
If you’ve enjoyed the tutorial up until now, you should click on the “Click to Tweet Button” below to share on Twitter. 😉
Check out a comprehensive logistic regression tutorial with Python
Some interesting things I find fascinating about this algorithm are:
However, besides every benefit of the algorithm, they are always some drawbacks such as:
In this post, you discovered the basic concept behind logistic regression and clarified examples, formulas and equations, python script, and some pros and cons behind the algorithm.
Do you have any questions about Logistic Regression or this post? Leave a comment and ask your question. I’ll do my best to answer.
To get access to the source codes used in all of the tutorials, leave your email address in any of the page’s subscription forms.
We have listed some useful resources below if you thirst for more reading.
To be notified when this next blog post goes live, be sure to enter your email address in the form below!
He's an entrepreneur who loves Computer Vision and Machine Learning.
Get weekly data science tips from David Praise that keeps you more informed. It’s data science school in bite-sized chunks!
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
Thank you very much for clarifying the idea behind regularization. I really needed this tutorial.
You’re welcome.
Hi there colleagues, how iѕ the whole thing, and what you would like to say reցarding this piece of writing,
in my view its actually awesome in support of me.
I’m doing great Christen. Thank you for your kind feedback.