Back to Basics, Part Tres: Logistic Regression | by Shreya Rao | Mar, 2023

By Jessie Hobb On Mar 2, 2023

An illustrated guide on Logistic Regression (with code!)

Welcome back to the final installment of our Back to Basics series, where we’ll delve into another fundamental machine learning algorithm: Logistic Regression. In the previous two articles, we helped our friend Mark determine the ideal selling price for his 2400 feet² house using Linear Regression and Gradient Descent.

Today, Mark comes back to us again for help. He lives in a fancy neighborhood where he thinks houses below a certain size don’t sell, and he is worried that his house might not sell either. He asked us to help him determine how likely it is that his house will sell. This is where Logistic Regression comes into play.

Logistic Regression is a type of algorithm that predicts the probability of a binary outcome, such as whether a house will sell or not. Unlike Linear Regression, Logistic Regression predicts probabilities using a range of 0% to 100%. Note the difference between predictions a linear regression model and logistic regression model make:

Let’s delve deeper into how logistic regression works by determining the probability of selling houses with varying sizes.

We start our process again by collecting data about house sizes in Mark’s neighborhood and seeing if they sold or not.

Now let’s plot these points:

Rather than representing the outcome of the plot as a binary output, it may be more informative to represent it using probabilities since that is the quantity we are trying to predict.

We represent 100% probability as 1 and 0% probability as 0

In our previous article, we learned about linear regression and its ability to fit a line to our data. But can it work for our problem where the desired output is a probability? Let’s find out by attempting to fit a line using linear regression.

We know that the formula for the best-fitting line is:

By following the steps outlined in linear regression, we can obtain optimal values for β₀ and β₁, which will result in the best-fitting line. Assuming we have done so, let’s take a look at the line that we have obtained:

Based on this line, we can see that a house with a size just below 2700 feet² is predicted to have a 100% probability of being sold:

…and a 2200 feet² house is predicted to have a 0% chance of being sold:

…and a 2300 feet² house is predicted to have about a 20% probability of being sold:

Alright, so far so good. But what if we have a house that is 2800 feet² in size?

Uh.. what does a probability above 100% mean? Would a house of this size be predicted to sell with a probability of 150%??

Weird. What about a house that’s 2100 feet²?

Okay, clearly we have run into a problem as the predicted probability for a house with a size of 2100 feet² appears to be negative. This definitely does not make sense, and it indicates an issue with using a standard linear regression line.

As we know, the range of probabilities is from 0 to 1, and we cannot exceed this range. So we need to find a way to constrain our predicted output to this range.

To solve this issue, we can pass our linear regression equation through a super cool machine called a *sigmoid function*. This machine transforms our predicted values to fall between 0 and 1. We input our z value (where z = β₀ + β₁size) into the machine…

…and out comes a fancy-looking new equation.

NOTE: The e in the output is a constant value and is approximately equal to 2.718.

A math-ier way of representing the sigmoid function:

If we plot this, we see that the sigmoid function squeezes the straight line into an s-shaped curve confined between 0 and 1.

Optional note for all my math-heads: You might be wondering why and how we used the sigmoid function to get our desired output. Let’s break it down.

We started with the incorrect assumption that using the linear regression formula will give us our desired probability.

The issue with this assumption is that (β₀ + β₁size) has range (-∞,+∞) and p has a range of [0,1]. So we need to find a value that has a range that matches that of (β₀ + β₁size).

To overcome this issue, we can equate the line to“log odds” (watch this video to understand log odds better) because we know that the log odds has a range of (-∞,+∞).

Now that we did that, it’s just a matter of rearranging this equation, so that we find what the p-value should equal.

Now that we know how to modify the linear regression line so that it fits our output constraints, we can return to our original problem.

We need to determine the optimal curve for our dataset. To achieve this, we need to identify the optimal values for β₀ and β₁ (because these are the only values in the predicted probability equation that will change the shape of the curve).

Similar to linear regression, we will leverage a cost function and the gradient descent algorithm to obtain suitable values for these coefficients. The key distinction, however, is that we will not be employing the MSE cost function used in linear regression. Instead, we will be using a different cost function called Log Loss, which we will explore in greater detail shortly.

Say we used gradient descent and the Log Loss cost to find that our optimal values are β₀ = -120.6 and β₁ = 0.051, then our predicted probability equation will be:

And the corresponding optimal curve is:

With this new curve, we can now tackle Mark’s problem. By looking at it, we can see that a house with a size of 2400 feet²…

…has a predicted probability of approximately 78%. Therefore, we can tell Mark not to worry because it looks like his house is pretty likely to sell.

We can further enhance our approach by developing a Classification Algorithm. A classification algorithm is commonly used in machine learning to categorize data into categories. In our case, we have two categories: houses that will sell and houses that will not sell.

To develop a classification algorithm, we need to define a threshold probability value. This threshold probability value separates the predicted probabilities into two categories, “yes, the house will sell” and “no, the house will not sell.” Typically, 50% (or 0.5) is used as the threshold value.

If the predicted probability for a house size is above 50%, it will be classified as “will sell,” and if it’s below 50%, it will be classified as “won’t sell.”

And that’s about it. That’s how we can use logistic regression to solve our problem. Now let’s understand the cost function we used to find optimal values for logistic regression.

Cost Function

In linear regression, the cost is based on how far off the line was from our data points. And, in logistic regression, the cost function depends on how far off our predictions are from our actual data, given that we are dealing with probabilities.

If we used the MSE cost function (like we did in linear regression) in logistic regression, we would end up with a non-convex (fancy term for a not-so-pretty-curve-that-can’t-be-used-effectively-in-gradient-decsent) cost function curve that can be difficult to optimize.

And as you may recall from our discussion on gradient descent, it is much easier to optimize a convex (aka a curve with a distinct minimum point) curve like this than a non-convex curve.

To achieve a convex cost function curve, we use a cost function called Log Loss.

To break down the Log Loss cost function, we need to define separate costs for when the house actually sold (y=1) and when it did not (y=0).

If y = 1 and we predicted 1 (i.e., 100% probability it sold), there is no penalty. However, if we predicted 0 (i.e., 0% probability it didn’t sell), then we get penalized heavily.

Similarly, if y = 0 and we predicted a high probability of the house selling, we should be penalized heavily, and if we predicted a low probability of the house selling, there should be a lower penalty. The more off we are, the more it costs us.

To compute the cost for all houses in our dataset, we can average the costs of all the individual predictions like this:

By cleverly rewriting the two equations, we can combine them into one to give us our Log Loss cost function.

This works because one of those two will always be zero, so only the other one will be used.

By combining both cost graphs, we get a convex graph with only one minimum point, making it easy to use gradient descent to find the optimal values of β₀ and β₁…

…and in turn, the optimal curve for our logistic regression.

Now that we have a good understanding of the math and intuition behind logistic regression, let’s see how Mark’s house size problem can be implemented in Python.