Techno Blender
Digitally Yours.

Ace your Machine Learning Interview – Part 3 | by Marcello Politi | Oct, 2022

0 50


Dive into Naive Bayes Classifier using Python

This is the third article in this series I have called “ Ace your Machine Learning Interview” in which I go over the foundations of Machine Learning. If you missed the first two articles you can find them here :

Introduction

Naive Bayes is a Machine Learning algorithm used to solve classification problems, and it is so-called because it is based on Bayes’ theorem.

An algorithm referred to as a classifier, assigns a class to each instance of data. For example, classifying whether an email is spam or non-spam.

Bayes Theorem

Bayes’ Theorem is used to calculate the probability of a cause resulting in the verified event. The formula we have all studied in probability courses is the following.

So this theorem answers the question: ‘What is the probability that event A will occur given that event B has occurred?And the interesting thing is that this formula turns the question around. That is, we can calculate this probability by going to see how many times B actually occurred each time event A had occurred. That is, we can answer the original question by going to see the past (the data).

Naive Bayes Classifier

But how then do we apply this theorem to create a Machine Learning classifier? Suppose we have a dataset consisting of n features and a target.

Therefore, our question now is ‘What is the probability of having a certain label y given that those features occurred?’

For example if y = spam/not-spam, x1 = len(email), x2 = number_of_attachments we might ask :

‘What is the probability that y is spam given that x1 = 100 chars and x2 = 2 attachments?’

To answer this question we need only apply Bayes’ theorem trivially, where A = {x1,x2,…,xn} and B = {y}.

But the classifier is not called Bayes Classifier but Naive Bayes Classifier. This is because a naive assumption is made to simplify the calculations, that is, the features are assumed to be independent of each other. This allows us to simplify the formula.

In this way, we can calculate the probability that y = spam. Next, we will calculate the probability that y = not_spam and see which one is more likely. But if you think about it, between the two labels, the one having higher probability will be the one with the larger numerator since the denominator is always the same : P(x1) * P(x2)*…

Then we can also eliminate for simplicity the denominator since for the purpose of comparison we do not care about it.

Now we are going to choose the class that maximizes this probability, we only need to use argmax.

Naive Bayes Classifier for Text Data

This algorithm is often used in the field of NLP for textual data. This is because we can treat individual words that appear in the text as features, and the naive assumption is that therefore these words are independent (which of course is not actually true).

Suppose we have a dataset in which on each row we have a single sentence, and each column tells us whether or not that word appears in the sentence. We have eliminated unnecessary words such as articles, etc.

Now we can calculate the probability that a new sentence is good or bad in the following way.

Let’s code!

Implementing the Naive Bayes algorithm in sklearn is very simple just a few lines of code. We will use the well-known Iris dataset that consists of the following features.

Advantages

From the point of view of benefits, the Naive Bayes algorithm has its simplicity of use. Although it is a basic and dated algorithm, it still solves some classification problems excellently with fair efficiency. However, its application is limited to a few specific cases. Summarizing :

  • Works well with many features
  • Works well with large training Datasets
  • It converges fast when training
  • It also performs well on categorical features
  • Robust to outliers

Disadvantages

From the point of view of drawbacks, the following should be specially mentioned. The algorithm requires knowledge of all the data in the problem. Especially the simple and conditional probabilities. This is often difficult and expensive information to obtain. The algorithm provides a “naive” approximation of the problem because it does not consider the correlation between the characteristics of the instance.

If a probability is zero because it was never observed in the data you have to apply Laplace smoothing.

Handle Missing Values

You can simply skip missing values. Let’s suppose we throw a coin 3 times, but we forgot what was the result the second time. We can try to sum up all the possibilities for that 2nd throw.

Naive Bayes is one of the main algorithms to know when approaching Machine Learning. It has been used heavily, especially in problems with text data, such as Spam email recognition. As we have seen it still has its advantages and disadvantages, but certainly when you are asked about basic Machine Learning expect a question about it!

Marcello Politi

Linkedin, Twitter, CV




Dive into Naive Bayes Classifier using Python

This is the third article in this series I have called “ Ace your Machine Learning Interview” in which I go over the foundations of Machine Learning. If you missed the first two articles you can find them here :

Introduction

Naive Bayes is a Machine Learning algorithm used to solve classification problems, and it is so-called because it is based on Bayes’ theorem.

An algorithm referred to as a classifier, assigns a class to each instance of data. For example, classifying whether an email is spam or non-spam.

Bayes Theorem

Bayes’ Theorem is used to calculate the probability of a cause resulting in the verified event. The formula we have all studied in probability courses is the following.

So this theorem answers the question: ‘What is the probability that event A will occur given that event B has occurred?And the interesting thing is that this formula turns the question around. That is, we can calculate this probability by going to see how many times B actually occurred each time event A had occurred. That is, we can answer the original question by going to see the past (the data).

Naive Bayes Classifier

But how then do we apply this theorem to create a Machine Learning classifier? Suppose we have a dataset consisting of n features and a target.

Therefore, our question now is ‘What is the probability of having a certain label y given that those features occurred?’

For example if y = spam/not-spam, x1 = len(email), x2 = number_of_attachments we might ask :

‘What is the probability that y is spam given that x1 = 100 chars and x2 = 2 attachments?’

To answer this question we need only apply Bayes’ theorem trivially, where A = {x1,x2,…,xn} and B = {y}.

But the classifier is not called Bayes Classifier but Naive Bayes Classifier. This is because a naive assumption is made to simplify the calculations, that is, the features are assumed to be independent of each other. This allows us to simplify the formula.

In this way, we can calculate the probability that y = spam. Next, we will calculate the probability that y = not_spam and see which one is more likely. But if you think about it, between the two labels, the one having higher probability will be the one with the larger numerator since the denominator is always the same : P(x1) * P(x2)*…

Then we can also eliminate for simplicity the denominator since for the purpose of comparison we do not care about it.

Now we are going to choose the class that maximizes this probability, we only need to use argmax.

Naive Bayes Classifier for Text Data

This algorithm is often used in the field of NLP for textual data. This is because we can treat individual words that appear in the text as features, and the naive assumption is that therefore these words are independent (which of course is not actually true).

Suppose we have a dataset in which on each row we have a single sentence, and each column tells us whether or not that word appears in the sentence. We have eliminated unnecessary words such as articles, etc.

Now we can calculate the probability that a new sentence is good or bad in the following way.

Let’s code!

Implementing the Naive Bayes algorithm in sklearn is very simple just a few lines of code. We will use the well-known Iris dataset that consists of the following features.

Advantages

From the point of view of benefits, the Naive Bayes algorithm has its simplicity of use. Although it is a basic and dated algorithm, it still solves some classification problems excellently with fair efficiency. However, its application is limited to a few specific cases. Summarizing :

  • Works well with many features
  • Works well with large training Datasets
  • It converges fast when training
  • It also performs well on categorical features
  • Robust to outliers

Disadvantages

From the point of view of drawbacks, the following should be specially mentioned. The algorithm requires knowledge of all the data in the problem. Especially the simple and conditional probabilities. This is often difficult and expensive information to obtain. The algorithm provides a “naive” approximation of the problem because it does not consider the correlation between the characteristics of the instance.

If a probability is zero because it was never observed in the data you have to apply Laplace smoothing.

Handle Missing Values

You can simply skip missing values. Let’s suppose we throw a coin 3 times, but we forgot what was the result the second time. We can try to sum up all the possibilities for that 2nd throw.

Naive Bayes is one of the main algorithms to know when approaching Machine Learning. It has been used heavily, especially in problems with text data, such as Spam email recognition. As we have seen it still has its advantages and disadvantages, but certainly when you are asked about basic Machine Learning expect a question about it!

Marcello Politi

Linkedin, Twitter, CV

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment