Why Do Naive Bayes Classifiers Perform So Well?

The assumption of independence underlying naive Bayes classification may be unrealistic, but it doesn’t always hinder performance.

Naive Bayes classifiers work surprisingly well despite their naive assumption of conditional independence amongst input variables.

The assumed independence rarely holds in practice, so the probability estimates produced by naive Bayes (which lead to classification decisions) are rarely very good.

So, why are the classification decisions produced by naive Bayes so good when the probability estimates are not?

There are two reasons:

What matters are the relativities between probability estimates rather than their absolute values¹. In naive Bayes, the relative values are what drive classification predictions.
The relative probability estimates lead to good classification predictions even when dependencies exist between the input variables (ie. when input variables are not independent).

Why do naive Bayes classifiers work well even when dependencies exist?

Recent research² suggests a possible explanation for the second point above, ie. why naive Bayes works well even when there are dependencies present.

The key is how the dependencies are distributed.

If the dependencies tend to work together, supporting certain classification decisions, then this can lead to good results.

Alternatively, if the dependencies tend to cancel each other out, then this reduces their net impact and also leads to good results.

This can apply even in the presence of strong dependencies. As long as the dependencies are either working together or cancelling each other out, naive Bayes will produce good results.

Why does naive Bayes classification assume independence?

Naive Bayes classification works by assuming conditional independence amongst input variables.

Why does this matter?

Without this assumption, the calculations for naive Bayes would be very difficult to do.

Unfortunately, the assumption of independence rarely holds in practice.

Consider applying naive Bayes to a document classification task. Here, we wish to classify documents into defined categories based on the words (input variables) in each document. Naive Bayes assumes that the words are independent of each other, but this is often not true.

Documents may contain words like “Hong” and “Kong“, for instance, or “London” and “English“. These words have strong associations with each other and are not independent.

Hence, naive Bayes incorrectly assumes independence even when it doesn’t apply in practice.

Nevertheless, as a practical compromise, naive Bayes assumes independence to make implementation easier.

And as we’ve discussed, naive Bayes produces very good results despite this unrealistic assumption.

The benefits of naive Bayes

Naive Bayes classification is a popular choice for classification and it performs well in a number of real-world applications.

Its key benefits are its simplicity, efficiency, ability to handle noisy data and for allowing multiple classes of classification³. It also doesn’t require a large amount of data to work well.

Another important benefit of naive Bayes is that it is robust to missing data. This is because it is a generative probabilistic model.

Generative models work by “generalizing” the observed data, ie. they draw conclusions based on how the observed data was generated. This contrasts to a discriminative approach, which draws conclusions directly from the observed data without generalizing. This “generalize-ability” of generative models make them quite versatile and robust to missing data.

A classic application of naive Bayes—Email spam filtering

Let’s look at a simple example to illustrate how naive Bayes works and how the independence assumption plays a role.

Email spam filtering is a well known application of naive Bayes classification. Naive Bayes has been used for this since the 1990’s and it’s still a popular choice due to it’s simplicity and effectiveness.

Bayes classification is a form of supervised learning, so a Bayes classifier learns (and improves) as more and more emails pass through it. These emails would be labeled as either spam or ham—where a ham email is a “good” email, one that we wish to keep—hence they are “supervising” the learning of the classifier.

A simple illustration

Say, we wish to classify an email as either spam or ham.

A basic naive Bayes classifier works by repeatedly applying Bayes rule to each of the words in the email (and the broader vocabulary) that we want to classify.

To learn more about Bayes rule, here’s a straightforward introduction to how Bayes rule works.

We’ll assume that our classifier has been trained on a history of emails and we wish to classify a new email.

Let’s assume:

Our data is based on an email history of 100 emails, of which 90 emails are ham and 10 emails are spam
We have a simple vocabulary consisting of only 5 words: “half“, “price“, “hot“, “coffee” and “mug“
The new email that we wish to classify consists of a single 3-word phrase: “half price mug“

Based on our email history, assume we have the following frequency counts:

Word	*Number of spam* emails containing the word**	*Number of ham* emails containing this word**
half	7	30
price	8	20
mug	5	60
hot	3	80
coffee	5	50

Email history frequency counts

Let’s first look at a single word, say “mug“.

Here, we’ll calculate the probability that an email is spam if it contains the word mug. We can use Bayes rule to do this.

By Bayes rule, for a given word in our email, the probability that our email is spam is:

P(spam | word) = [ P(word | spam) * P(spam) ] / P(word)

So, for the word mug:

P(spam | mug) = [ P(mug | spam) * P(spam) ] / P(mug)

This is a conditional probability, ie. the probability that our email is spam given that the word mug appears in it.

Using the above data:

P(mug | spam) = number of times the word mug appears in spam emails / number of spam emails = 5 / 10 = 0.5
P(spam) = number of spam emails / total number of emails = 10 / 100 = 0.1
P(mug) = number of times the word mug appears in all emails / total number of emails = (5 + 60) / 100 = 0.65

So, P(spam | mug) = ( 0.5 * 0.1 ) / 0.65 = 0.077

So, based on our calculations, the probability that an email is spam, given that it contains the word mug, is 0.077.

3-word phrase

Now let’s look at the 3-word phrase in our new email—half price mug.

We wish to classify our email as spam or ham given that it contains the phrase “half price mug”.

Again, we’ll use Bayes rule and apply it to the whole phrase—half price mug—as follows:

Firstly, we’ll calculate the probability that our email is spam
Then, we’ll calculate the probability that our email is ham
We can then classify our email based on whether it has a higher probability of being spam or ham

By Bayes rule, we can calculate the probability that our email is spam (given that it contains the phrase half price mug) as follows:

P(spam | half price mug) = [ P(half price mug | spam) * P(spam) ] / P(half price mug)

Similarly, the probability that our email is ham is given by:

P(ham | half price mug) = [ P(half price mug | ham) * P(ham) ] / P(half price mug)

Calculating conditional probabilities

Consider P(half price mug | spam). This is the probability of seeing the phrase half price mug in a spam email.

We can calculate this using the independence assumption of naive Bayes:

The above equation calculates the probability of seeing the words half and price and mug, and not seeing the words hot and coffee, in a spam email. Recall that hot and coffee are the remaining words in our 5-word simple vocabulary.

Using our data:

P(half | spam) = 7 / 10 = 0.7
P(price | spam) = 8 / 10 = 0.8
P(mug | spam) = 5 / 10 = 0.5
P(hot | spam) = 3 / 10 = 0.3
P(coffee | spam) = 5 / 10 = 0.5

So, P(half price mug | spam) = 0.7 * 0.8 * 0.5 * (1 – 0.3) * (1 – 0.5) = 0.098

Similarly, the probability of seeing half price mug in a ham email is:

And using our data:

P(half | ham) = 30 / 90 = 0.33
P(price | ham) = 20 / 90 = 0.22
P(mug | ham) = 60 / 90 = 0.67
P(hot | ham) = 80 / 90 = 0.89
P(coffee | ham) = 50 / 90 = 0.56

So, P(half price mug | ham) = 0.33 * 0.22 * 0.67 * (1 – 0.89) * (1 – 0.56) = 0.002

We also know:

P(spam) = 10 / 100 = 0.1
P(ham) = 90 / 100 = 0.9

Next, how do we calculate P(half price mug)?

This is the probability of seeing our 3-word phrase—half price mug—in any email.

Since we know from our history of emails that each email is either spam or ham, we can use conditional probabilities as follows:

P(half price mug) = P(half price mug | spam) * P(spam) + P(half price mug | ham) * P(ham) = 0.098 * 0.1 + 0.002 * 0.9 = 0.012

Classifying

We now have all of the calculations that we need to classify our email.

The probability that our email is spam is:

P(spam | half price mug) = [ P(half price mug | spam) * P(spam) ] / P(half price mug) = [ 0.098 * 0.1 ] / 0.012 = 0.82

And, the probability that our email is ham is:

P(ham | half price mug) = [ P(half price mug | ham) * P(ham) ] / P(half price mug) = [ 0.002 * 0.9 ] / 0.012 = 0.15

We can see that the probability that our email is spam is higher than the probability that it’s ham. So, we classify our email as spam.

The independence assumption and why naive Bayes classification can work well

In our simple example we made the crucial assumption that the words in our vocabulary were independent of each other. This made the calculation easy—we simply multiplied our individual word probabilities together.

In reality, many words are not independent.

Think of the words hot and coffee for instance—there’s a clear associations between these words in everyday usage. When was the last time you enjoyed a hot coffee, for instance?

As we’ve discussed, naive Bayes works well in lots of situations despite its naive independence assumption.

Let’s look at how this might work for our simple example.

A key calculation in our example was:

This is a simple calculation due to the independence assumption. It would have been more complicated (or extremely difficult, depending on available data) if the independence assumption was not made.

If we were to include the possible dependencies between words in this calculation (ie. account for the extent to which the words are not independent), we would need to include some additional terms in the equation.

In many situations, these additional terms could cancel out.

The words hot and coffee, for instance, may interact positively. Other words, such as half and mug, may interact negatively.

If this were to occur, the interactions would tend to cancel out, and the classification decision would be largely unaffected. This is an illustration of why naive Bayes can work well in practice.

Further, since classification depends on relative probabilities between the classification classes, any distortions due to the independence assumption would need to be large enough to change the relativities between classes.

In our example, the probability that our email is spam was 0.82. This compares with a probability of 0.15 that it was ham. There’s a gap of 0.67 between these probabilities, which is fairly large.

So, to change the classification decision, the relative probabilities would need to change by 0.67 or more. This would require a fairly significant change in the relative probabilities of spam vs ham, which is not always likely.

This illustrates how classification decisions can be fairly robust under a naive Bayes framework.

In summary

Naive Bayes classifiers work well despite their underlying independence assumption rarely holding in practice
They work well due to (i) the importance of their relative, rather than absolute, probability estimates, and (ii) the way in which dependencies, when they do exist, are distributed
The distribution of dependencies can work together or cancel out each other in a way that does not impact classification decisions
The independence assumption underlying naive Bayes helps to make calculations easier—without it, naive Bayes would be very difficult to implement in many situations
A particular benefit of naive Bayes is its robustness to missing data—this is because it is a generative probabilistic model that can accommodate missing data due to its ability to “generalize“

References

[1] C. D. Manning, P. Raghavan, H. Schutze, An introduction to information retrieval, Cambridge University Press, April 2009.

[2] H. Zhang, The optimality of naive Bayes, American Association for Artificial Intelligence, 2004.

[3] S. S. Y. Ng, Y. Xing and K. L. Tsui, A naive Bayes model for robust remaining useful life prediction of lithium-ion battery, Applied Energy, 118, p. 115, 2014.

Why Do Naive Bayes Classifiers Perform So Well?

Why do naive Bayes classifiers work well even when dependencies exist?

Why does naive Bayes classification assume independence?

The benefits of naive Bayes