Techno Blender
Digitally Yours.

How to implement multiclass Logistic Regression

0 40


An introduction to multiclass logistic regression with theory and Python implementation

Decision boundary of Logistic Regression. Image by author.

Contents

This post is a part of a series of posts that I will be making. Underneath you can see an overview of the series.

1. Introduction to machine learning

2. Regression

3. Classification

Setup and objective

So far we’ve gone over generative classifiers (QDA, LDA, and Naive Bayes), but now we’ll turn our eyes to a discriminative classifier: logistic regression. As mentioned in 3(a), the overview of classifiers, logistic regression is a discriminative classifier meaning that it models the conditional probability distribution of the target P(t|x) directly.

But why is it called logistic regression and not logistic classification if it’s a classification model? Well, the answer is simply that we’re regressing the conditional probability — I know, it’s confusing.

Before we move on, I recommend that you have a good grasp on linear regression before moving onto logistic regression, as they’re very similar. If you read my post about linear regression, you’ll have an easy time, as I’m using the same terminology and notation.

Given a training dataset of N input variables x with corresponding target variables t, logistic regression assumes that

where c is any integer from 1 to C denoting the number of classes.

The right hand side of (1) is the softmax function. Basically, this is the function that we use to transform our linear combination into probabilities between 0 and 1.

To understand what (1) really means, let’s look at the special case where C=2, i.e., we have 2 classes. We would now typically rewrite (1) as

where σ refers to the logistic sigmoid function, which is where the name logistic regression comes from.

Now, the logistic sigmoid function looks like this:

insert image of sigmoid function*

As you can see, the y-values lie between 0 and 1, and the higher the value, the closer to 1. This means that the larger wᵀx is, the higher the probability that point (x, t) is of class 1 — and likewise for more than 2 classes, whichever value of c is higher in (1) will be the class we predict for t, or

Derivation and training

So, how do we find the values of W? Well, just like with linear regression we’ll use maximum likelihood estimation to find the values of our parameters. This works by writing up the likelihood function, taking its derivative, and finding for which values of the parameters the derivative is equal to 0, as this will be the maximum of the likelihood.

To make notation easier, let tₙ denote a C-dimensional vector, where tₙ꜀ is 1 if the observation belongs to class c, and all other components are 0. We can now write the likelihood as

Just like with linear regression, we’ll now take the negative logarithm of the likelihood, as we know minimizing the negative log-likelihood is equivalent to maximizing the likelihood, to help our derivation

Now that we have the negative log-likelihood, we’re going to find the gradient (the derivative) of it and find for which values of w it is equal to 0. The function in (3) is also called the cross-entropy loss function.

Before we begin the derivation, let’s define some terms to make our notation simpler. Firstly, let us denote the softmax function as

Secondly, let E denote the function from (3)

Now, using the chain rule, we can determine the gradient of E

Starting from the right, we have

Next up, we have the derivative of the softmax function

but this is only when i=c. In the case where they are not equal, we get

We can combine (6) and (7) to

Lastly, we have

and now we finally have all the pieces of (4) to put together. Putting (5), (7), and (9) into (4) gives us the following

After all that work we’ve finally found the gradient of the likelihood function. What we need to do now is figure out for which values of w it is equal to 0. The reason for this is that the maxima and minima of the function will be where the gradient is 0.

The problem is that we cannot find a closed-form solution for this, so we’ll need an algorithm that figures out where the function is equal to 0 for us. The algorithm we’ll use is called gradient descent. It works by using the following equation:

where η is called the learning rate and is a hyperparameter. I’ll write a post detailing gradient descent in the future, but briefly if the learning rate is too high you’ll miss the minimum of the gradient function, and if it’s too low it will take too long to reach the minimum.

Python implementation

The code underneath is a simple implementation of (Gaussian) Naive Bayes that we just went over.

Underneath is a chart with the data points (color coded to match their respective classes) and the decision boundaries generated by the logistic regression.

Decision boundary produced by Logistic Regression with Gradient Descent optimisation. Image by author.

Summary

  • Logistic regression is a classification model.
  • Logistic regression is a discriminative classifier.
  • If we have 2 classes, we use the logistic sigmoid function to transform our linear function into probabilities.
  • The softmax function is the generalisation of the logistic sigmoid function to multiple classes.
  • The negative log-likelihood in logistic regression can also be referred to as the cross-entropy loss function.
  • There is no closed-form solution to logistic regression, hence we use gradient descent.
  • The learning rate and number of iterations are hyperparameters that you will have to tweak.


An introduction to multiclass logistic regression with theory and Python implementation

Decision boundary of Logistic Regression. Image by author.

Contents

This post is a part of a series of posts that I will be making. Underneath you can see an overview of the series.

1. Introduction to machine learning

2. Regression

3. Classification

Setup and objective

So far we’ve gone over generative classifiers (QDA, LDA, and Naive Bayes), but now we’ll turn our eyes to a discriminative classifier: logistic regression. As mentioned in 3(a), the overview of classifiers, logistic regression is a discriminative classifier meaning that it models the conditional probability distribution of the target P(t|x) directly.

But why is it called logistic regression and not logistic classification if it’s a classification model? Well, the answer is simply that we’re regressing the conditional probability — I know, it’s confusing.

Before we move on, I recommend that you have a good grasp on linear regression before moving onto logistic regression, as they’re very similar. If you read my post about linear regression, you’ll have an easy time, as I’m using the same terminology and notation.

Given a training dataset of N input variables x with corresponding target variables t, logistic regression assumes that

where c is any integer from 1 to C denoting the number of classes.

The right hand side of (1) is the softmax function. Basically, this is the function that we use to transform our linear combination into probabilities between 0 and 1.

To understand what (1) really means, let’s look at the special case where C=2, i.e., we have 2 classes. We would now typically rewrite (1) as

where σ refers to the logistic sigmoid function, which is where the name logistic regression comes from.

Now, the logistic sigmoid function looks like this:

insert image of sigmoid function*

As you can see, the y-values lie between 0 and 1, and the higher the value, the closer to 1. This means that the larger wᵀx is, the higher the probability that point (x, t) is of class 1 — and likewise for more than 2 classes, whichever value of c is higher in (1) will be the class we predict for t, or

Derivation and training

So, how do we find the values of W? Well, just like with linear regression we’ll use maximum likelihood estimation to find the values of our parameters. This works by writing up the likelihood function, taking its derivative, and finding for which values of the parameters the derivative is equal to 0, as this will be the maximum of the likelihood.

To make notation easier, let tₙ denote a C-dimensional vector, where tₙ꜀ is 1 if the observation belongs to class c, and all other components are 0. We can now write the likelihood as

Just like with linear regression, we’ll now take the negative logarithm of the likelihood, as we know minimizing the negative log-likelihood is equivalent to maximizing the likelihood, to help our derivation

Now that we have the negative log-likelihood, we’re going to find the gradient (the derivative) of it and find for which values of w it is equal to 0. The function in (3) is also called the cross-entropy loss function.

Before we begin the derivation, let’s define some terms to make our notation simpler. Firstly, let us denote the softmax function as

Secondly, let E denote the function from (3)

Now, using the chain rule, we can determine the gradient of E

Starting from the right, we have

Next up, we have the derivative of the softmax function

but this is only when i=c. In the case where they are not equal, we get

We can combine (6) and (7) to

Lastly, we have

and now we finally have all the pieces of (4) to put together. Putting (5), (7), and (9) into (4) gives us the following

After all that work we’ve finally found the gradient of the likelihood function. What we need to do now is figure out for which values of w it is equal to 0. The reason for this is that the maxima and minima of the function will be where the gradient is 0.

The problem is that we cannot find a closed-form solution for this, so we’ll need an algorithm that figures out where the function is equal to 0 for us. The algorithm we’ll use is called gradient descent. It works by using the following equation:

where η is called the learning rate and is a hyperparameter. I’ll write a post detailing gradient descent in the future, but briefly if the learning rate is too high you’ll miss the minimum of the gradient function, and if it’s too low it will take too long to reach the minimum.

Python implementation

The code underneath is a simple implementation of (Gaussian) Naive Bayes that we just went over.

Underneath is a chart with the data points (color coded to match their respective classes) and the decision boundaries generated by the logistic regression.

Decision boundary produced by Logistic Regression with Gradient Descent optimisation. Image by author.

Summary

  • Logistic regression is a classification model.
  • Logistic regression is a discriminative classifier.
  • If we have 2 classes, we use the logistic sigmoid function to transform our linear function into probabilities.
  • The softmax function is the generalisation of the logistic sigmoid function to multiple classes.
  • The negative log-likelihood in logistic regression can also be referred to as the cross-entropy loss function.
  • There is no closed-form solution to logistic regression, hence we use gradient descent.
  • The learning rate and number of iterations are hyperparameters that you will have to tweak.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment