Is Naive Bayes Classification Supervised Or Unsupervised?
What kind of learning is naive Bayes classification—and why?
Naive Bayes classification is a form of supervised learning. It is considered to be supervised since naive Bayes classifiers are trained using labeled data, ie. data that has been pre-categorized into the classes that are available for classification.
This contrasts with unsupervised learning, where there is no pre-labeled data available. This type of learning seeks to find natural structures that may be present in a data set, with no pre-determined knowledge about how to classify the data.
The training of a simple naive Bayes classifier involves learning the probabilities that underpin the classification tasks that the classifier performs. Using pre-labeled data, these probabilities may be estimated from frequency counts of the data, based on which classes they fall into (ie. based on their pre-labeled classes).
More complex Bayes classification models assume generative distributions for the data being classified. Their training may involve learning the parameters of the assumed distributions rather than raw frequency counts.
Simple naive Bayes classification
Naive Bayes classifiers are often used for text classification tasks. A popular application of naive Bayes, for instance, is classifying emails into spam (ie. emails that you don’t wish to read) and ham (ie. emails that you do wish to read).
Naive Bayes classifiers assume conditional independence amongst the features (or input variables) of the data. This makes calculations much easier. In the case of simple text classification, this translates to assuming that the words in the text documents being analyzed are independent of each other within each class.
The conditional independence assumption, however, rarely holds true in the real world. Yet, naive Bayes classifiers perform surprisingly well in practice. To understand why this is the case, here’s a common-sense explanatory article.
Naive Bayes supervised learning in action—A simple illustration
To illustrate how naive Bayes classifiers learn through supervision, let’s look at a simple example.
Naive Bayes text classification
We’ll consider a text classification task. Say, we want to classify emails into one of two classes: Work-Related (WIP) or Personal.
Naive Bayes classifiers can learn as more and more (labeled) data is passed to them for training. To keep things simple, in our example we’ll assume that there are only 5 words in our vocabulary: “chores”, “pending”, “action”, “this” and “morning”.
Our goal will be to classify an email containing the single phrase “chores pending action” as either WIP or Personal.
This is an interesting task because this phrase could easily be interpreted as either work-related or personal—have you ever had chores to do at either work or at home, for instance?
In this example we’ll see how this phrase (or more specifically, the email containing this phrase) can be classified based on the supervised training of the classifier.
We’ll also see how this classification can change with updated data—as new training data is received, the classification decision can change to reflect the updated supervised learning of the classifier.
Training data—First batch
Let’s consider a first batch of 100 emails for training our classifier.
Since the learning is supervised, each email is labeled with a class, either WIP or Personal.
Let’s assume there are 40 WIP emails and 60 Personal emails in the first batch.
We’ll also assume that we have the following frequency counts, ie. the number emails—classed as WIP or Personal—that include words from our vocabulary in them:
Word | No. of WIP emails in which the word appears | No. of Personal emails in which the word appears | TOTAL |
chores | 18 | 15 | 33 |
pending | 10 | 10 | 20 |
action | 6 | 12 | 18 |
this | 4 | 3 | 7 |
morning | 2 | 20 | 22 |
TOTAL | 40 | 60 | 100 |
Using Bayes theorem (see this article for an introductory explanation of Bayes theorem), we can find the probability that an email belongs to certain class—given that it has a certain word in it—as follows:
P(class | word) = [ P(word | class) * P(class) ] / P(word)
So, the probability that an email containing the phrase “chores pending action” (and nothing else) belongs to a certain class is:
P(class | chores pending action) = [ P(chores pending action | class) * P(class) ] / P(chores pending action)
This can be applied to each class, hence:
P(WIP | chores pending action) = [ P(chores pending action | WIP) * P(WIP) ] / P(chores pending action)
And,
P(Personal | chores pending action) = [ P(chores pending action | Personal) * P(Personal) ] / P(chores pending action)
We can calculate these using the data from our first training batch, as follows (for a detailed step-through of how this process works, see this example):
Word | P(word | WIP) | P(word | Personal) |
chores | 0.45 (=18/40) | 0.25 (=15/60) |
pending | 0.25 (=10/40) | 0.17 (=10/60) |
action | 0.15 (=6/40) | 0.20 (=12/60) |
this | 0.10 (=4/40) | 0.05 (=3/60) |
morning | 0.05 (=2/40) | 0.33 (=20/60) |
Probability | Calculation |
P(chores pending action | WIP) | 0.014 (=0.45*0.25*0.15*[1-0.10]*[1-0.05]) |
P(chores pending action | Personal) | 0.005 (=0.25*0.17*0.20*[1-0.05]*[1-0.33]) |
P(WIP) | 0.40 (=40/[40+60]) |
P(Personal) | 0.60 (=60/[40+60]) |
P(WIP | chores pending action) | 0.65 (=[0.014*0.40]/[(0.014*0.40)+(0.005*0.60)] |
P(Personal | chores pending action) | 0.35 (=[0.005*0.60]/[(0.014*0.40)+(0.005*0.60)] |
Note the role that the conditional independence assumption plays—without it, calculating P(chores pending action | class) would be far more difficult than the simple multiplications shown above.
We can now classify our email as either WIP or Personal depending on which has the higher probability, given that it contains the phrase “chores pending action” and nothing else.
Based on our calculations:
P(WIP | chores pending action) = 0.65
And,
P(Personal | chores pending action) = 0.35
So, we can classify our email as WIP.
Training data—Second Batch
Let’s consider adding another batch of 100 emails to our training data—batch 2. Assume there are 55 WIP emails and 45 Personal emails in this second batch.
Assume we have the following frequency counts for batch 2:
Word | No. of WIP emails in which the word appears | No. of Personal emails in which the word appears | TOTAL |
chores | 10 | 11 | 21 |
pending | 8 | 8 | 16 |
action | 11 | 9 | 20 |
this | 6 | 8 | 14 |
morning | 20 | 9 | 29 |
TOTAL | 55 | 45 | 100 |
If we add the frequency counts for batches 1 and 2 together, or in other words if we update our batch 1 frequency counts with the new data in batch 2, we’ll get the following frequency counts:
Word | No. of WIP emails in which the word appears | No. of Personal emails in which the word appears | TOTAL |
chores | 28 | 26 | 54 |
pending | 18 | 18 | 36 |
action | 17 | 21 | 38 |
this | 10 | 11 | 21 |
morning | 22 | 29 | 51 |
TOTAL | 95 | 105 | 200 |
Using the same process as before, we can now calculate updated probabilities for the classification of our email as follows:
Word | P(word | WIP) | P(word | Personal) |
chores | 0.29 (=28/95) | 0.25 (=26/105) |
pending | 0.19 (=18/95) | 0.17 (=18/105) |
action | 0.18 (=17/95) | 0.20 (=21/105) |
this | 0.11 (=10/95) | 0.10 (=11/105) |
morning | 0.23 (=22/95) | 0.28 (=29/105) |
Probability | Calculation |
P(chores pending action | WIP) | 0.007 (=0.29*0.19*0.18*[1-0.11]*[1-0.23]) |
P(chores pending action | Personal) | 0.006 (=0.25*0.17*0.20*[1-0.10]*[1-0.28]) |
P(WIP) | 0.48 (=95/[95+105]) |
P(Personal) | 0.53 (=105/[95+105]) |
P(WIP | chores pending action) | 0.53 (=[0.007*0.48]/[(0.007*0.48)+(0.006*0.53)] |
P(Personal | chores pending action) | 0.47 (=[0.006*0.53]/[(0.007*0.48)+(0.006*0.53)] |
Based on our updated calculations:
P(WIP | chores pending action) = 0.53
And,
P(Personal | chores pending action) = 0.47
So, we still classify our email as WIP.
Notice, however, that the probabilities for each class have changed—this is the process of learning using labeled data, ie. supervised learning.
Training data—Third Batch
Let’s now add one more batch of 100 emails to our training data—batch 3. Assume there are 45 WIP emails and 55 Personal emails in this third batch.
Assume we have the following frequency counts for batch 3:
Word | No. of WIP emails in which the word appears | No. of Personal emails in which the word appears | TOTAL |
chores | 14 | 15 | 29 |
pending | 9 | 11 | 20 |
action | 8 | 14 | 22 |
this | 8 | 6 | 14 |
morning | 6 | 9 | 15 |
TOTAL | 45 | 55 | 100 |
As before, we’ll add the frequency counts for batch 3 to those of 1 and 2, ie. we’ll update our frequency counts with the new data in batch 3.
Once we do this, we’ll get the following updated frequency counts:
Word | No. of WIP emails in which the word appears | No. of Personal emails in which the word appears | TOTAL |
chores | 42 | 41 | 83 |
pending | 27 | 29 | 56 |
action | 25 | 35 | 60 |
this | 18 | 17 | 35 |
morning | 28 | 38 | 66 |
TOTAL | 140 | 160 | 300 |
Again, using the same process as before we can calculate updated probabilities for the classification of our email:
Word | P(word | WIP) | P(word | Personal) |
chores | 0.30 (=42/140) | 0.26 (=41/160) |
pending | 0.19 (=27/140) | 0.18 (=29/160) |
action | 0.18 (=25/140) | 0.22 (=35/160) |
this | 0.13 (=18/140) | 0.11 (=17/160) |
morning | 0.20 (=28/140) | 0.24 (=38/160) |
Probability | Calculation |
P(chores pending action | WIP) | 0.007 (=0.30*0.19*0.18*[1-0.13]*[1-0.20]) |
P(chores pending action | Personal) | 0.007 (=0.26*0.18*0.22*[1-0.11]*[1-0.24]) |
P(WIP) | 0.47 (=140/[140+160]) |
P(Personal) | 0.53 (=160/[140+160]) |
P(WIP | chores pending action) | 0.48 (=[0.007*0.47]/[(0.007*0.47)+(0.007*0.53)] |
P(Personal | chores pending action) | 0.52 (=[0.007*0.53]/[(0.007*0.47)+(0.007*0.53)] |
Based on our updated calculations:
P(WIP | chores pending action) = 0.48
And,
P(Personal | chores pending action) = 0.52
We now see that the classification for our email changes—from WIP to Personal—based on the updated supervised learning of our classifier.
Labeled training data for supervised learning
This example shows how supervised learning works for simple naive Bayes classification.
The important thing to note is that the training data is labeled with the classes that each data point belongs to. This labeling guides—or supervises—the training of the classifier, since the classifier will learn to classify new data based on these labels.
In summary
- Naive Bayes classification is a form of supervised learning—it uses training data that is pre-labeled with the available classifications
- Unsupervised learning, in contrast to naive Bayes classification, does not use pre-labeled data and has no pre-determined knowledge about how to classify the data
- Naive Bayes classification assumes conditional independence between input variables—this makes calculations much simpler
- Through supervised learning, simple naive Bayes classifiers learn the probabilities that underpin classification decisions (or the parameters of assumed generative distributions in more complex versions of naive Bayes classification)
- As more pre-labeled training data becomes available, naive Bayes classifiers can update their probability estimates through continued supervised learning