Is Naive Bayes Classification Supervised Or Unsupervised?

What kind of learning is naive Bayes classification—and why?

Naive Bayes classification is a form of supervised learning. It is considered to be supervised since naive Bayes classifiers are trained using labeled data, ie. data that has been pre-categorized into the classes that are available for classification.

This contrasts with unsupervised learning, where there is no pre-labeled data available. This type of learning seeks to find natural structures that may be present in a data set, with no pre-determined knowledge about how to classify the data.

The training of a simple naive Bayes classifier involves learning the probabilities that underpin the classification tasks that the classifier performs. Using pre-labeled data, these probabilities may be estimated from frequency counts of the data, based on which classes they fall into (ie. based on their pre-labeled classes).

More complex Bayes classification models assume generative distributions for the data being classified. Their training may involve learning the parameters of the assumed distributions rather than raw frequency counts.

Simple naive Bayes classification

Naive Bayes classifiers are often used for text classification tasks. A popular application of naive Bayes, for instance, is classifying emails into spam (ie. emails that you don’t wish to read) and ham (ie. emails that you do wish to read).

Naive Bayes classifiers assume conditional independence amongst the features (or input variables) of the data. This makes calculations much easier. In the case of simple text classification, this translates to assuming that the words in the text documents being analyzed are independent of each other within each class.

The conditional independence assumption, however, rarely holds true in the real world. Yet, naive Bayes classifiers perform surprisingly well in practice. To understand why this is the case, here’s a common-sense explanatory article.

Naive Bayes supervised learning in action—A simple illustration

To illustrate how naive Bayes classifiers learn through supervision, let’s look at a simple example.

Naive Bayes text classification

We’ll consider a text classification task. Say, we want to classify emails into one of two classes: Work-Related (WIP) or Personal.

Naive Bayes classifiers can learn as more and more (labeled) data is passed to them for training. To keep things simple, in our example we’ll assume that there are only 5 words in our vocabulary: “chores”, “pending”, “action”, “this” and “morning”.

Our goal will be to classify an email containing the single phrase “chores pending action” as either WIP or Personal.

This is an interesting task because this phrase could easily be interpreted as either work-related or personal—have you ever had chores to do at either work or at home, for instance?

In this example we’ll see how this phrase (or more specifically, the email containing this phrase) can be classified based on the supervised training of the classifier.

We’ll also see how this classification can change with updated data—as new training data is received, the classification decision can change to reflect the updated supervised learning of the classifier.

Training data—First batch

Let’s consider a first batch of 100 emails for training our classifier.

Since the learning is supervised, each email is labeled with a class, either WIP or Personal.

Let’s assume there are 40 WIP emails and 60 Personal emails in the first batch.

We’ll also assume that we have the following frequency counts, ie. the number emails—classed as WIP or Personal—that include words from our vocabulary in them:

Word	*No. of WIP* emails in which the word appears**	*No. of Personal* emails in which the word appears**	TOTAL
chores	18	15	33
pending	10	10	20
action	6	12	18
this	4	3	7
morning	2	20	22
TOTAL	40	60	100

Batch 1 training data

Using Bayes theorem (see this article for an introductory explanation of Bayes theorem), we can find the probability that an email belongs to certain class—given that it has a certain word in it—as follows:

P(class | word) = [ P(word | class) * P(class) ] / P(word)

So, the probability that an email containing the phrase “chores pending action” (and nothing else) belongs to a certain class is:

P(class | chores pending action) = [ P(chores pending action | class) * P(class) ] / P(chores pending action)

This can be applied to each class, hence:

P(WIP | chores pending action) = [ P(chores pending action | WIP) * P(WIP) ] / P(chores pending action)

And,

P(Personal | chores pending action) = [ P(chores pending action | Personal) * P(Personal) ] / P(chores pending action)

We can calculate these using the data from our first training batch, as follows (for a detailed step-through of how this process works, see this example):

Word	*P(word* \| WIP)**	*P(word* \| Personal)**
chores	0.45 (=18/40)	0.25 (=15/60)
pending	0.25 (=10/40)	0.17 (=10/60)
action	0.15 (=6/40)	0.20 (=12/60)
this	0.10 (=4/40)	0.05 (=3/60)
morning	0.05 (=2/40)	0.33 (=20/60)

Batch 1 conditional probabilities for words in the vocabulary

Probability	Calculation
P(chores pending action \| WIP)	0.014 (=0.450.250.15[1-0.10][1-0.05])
P(chores pending action \| Personal)	0.005 (=0.250.170.20[1-0.05][1-0.33])
P(WIP)	0.40 (=40/[40+60])
P(Personal)	0.60 (=60/[40+60])
P(WIP \| chores pending action)	0.65 (=[0.0140.40]/[(0.0140.40)+(0.005*0.60)]
P(Personal \| chores pending action)	0.35 (=[0.0050.60]/[(0.0140.40)+(0.005*0.60)]

Batch 1 probability calculations

Note the role that the conditional independence assumption plays—without it, calculating P(chores pending action | class) would be far more difficult than the simple multiplications shown above.

We can now classify our email as either WIP or Personal depending on which has the higher probability, given that it contains the phrase “chores pending action” and nothing else.

Based on our calculations:

P(WIP | chores pending action) = 0.65

And,

P(Personal | chores pending action) = 0.35

So, we can classify our email as WIP.

Training data—Second Batch

Let’s consider adding another batch of 100 emails to our training data—batch 2. Assume there are 55 WIP emails and 45 Personal emails in this second batch.

Assume we have the following frequency counts for batch 2:

Word	*No. of WIP* emails in which the word appears**	*No. of Personal* emails in which the word appears**	TOTAL
chores	10	11	21
pending	8	8	16
action	11	9	20
this	6	8	14
morning	20	9	29
TOTAL	55	45	100

Batch 2 training data

If we add the frequency counts for batches 1 and 2 together, or in other words if we update our batch 1 frequency counts with the new data in batch 2, we’ll get the following frequency counts:

Word	*No. of WIP* emails in which the word appears**	*No. of Personal* emails in which the word appears**	TOTAL
chores	28	26	54
pending	18	18	36
action	17	21	38
this	10	11	21
morning	22	29	51
TOTAL	95	105	200

Batch 1+2 training data

Using the same process as before, we can now calculate updated probabilities for the classification of our email as follows:

Word	*P(word* \| WIP)**	*P(word* \| Personal)**
chores	0.29 (=28/95)	0.25 (=26/105)
pending	0.19 (=18/95)	0.17 (=18/105)
action	0.18 (=17/95)	0.20 (=21/105)
this	0.11 (=10/95)	0.10 (=11/105)
morning	0.23 (=22/95)	0.28 (=29/105)

Batch 1+2 conditional probabilities for words in the vocabulary

Probability	Calculation
P(chores pending action \| WIP)	0.007 (=0.290.190.18[1-0.11][1-0.23])
P(chores pending action \| Personal)	0.006 (=0.250.170.20[1-0.10][1-0.28])
P(WIP)	0.48 (=95/[95+105])
P(Personal)	0.53 (=105/[95+105])
P(WIP \| chores pending action)	0.53 (=[0.0070.48]/[(0.0070.48)+(0.006*0.53)]
P(Personal \| chores pending action)	0.47 (=[0.0060.53]/[(0.0070.48)+(0.006*0.53)]

Batch 1+2 probability calculations

Based on our updated calculations:

P(WIP | chores pending action) = 0.53

And,

P(Personal | chores pending action) = 0.47

So, we still classify our email as WIP.

Notice, however, that the probabilities for each class have changed—this is the process of learning using labeled data, ie. supervised learning.

Training data—Third Batch

Let’s now add one more batch of 100 emails to our training data—batch 3. Assume there are 45 WIP emails and 55 Personal emails in this third batch.

Assume we have the following frequency counts for batch 3:

Word	*No. of WIP* emails in which the word appears**	*No. of Personal* emails in which the word appears**	TOTAL
chores	14	15	29
pending	9	11	20
action	8	14	22
this	8	6	14
morning	6	9	15
TOTAL	45	55	100

Batch 3 training data

As before, we’ll add the frequency counts for batch 3 to those of 1 and 2, ie. we’ll update our frequency counts with the new data in batch 3.

Once we do this, we’ll get the following updated frequency counts:

Word	*No. of WIP* emails in which the word appears**	*No. of Personal* emails in which the word appears**	TOTAL
chores	42	41	83
pending	27	29	56
action	25	35	60
this	18	17	35
morning	28	38	66
TOTAL	140	160	300

Batch 1+2+3 training data

Again, using the same process as before we can calculate updated probabilities for the classification of our email:

Word	*P(word* \| WIP)**	*P(word* \| Personal)**
chores	0.30 (=42/140)	0.26 (=41/160)
pending	0.19 (=27/140)	0.18 (=29/160)
action	0.18 (=25/140)	0.22 (=35/160)
this	0.13 (=18/140)	0.11 (=17/160)
morning	0.20 (=28/140)	0.24 (=38/160)

Batch 1+2+3 conditional probabilities for words in the vocabulary

Probability	Calculation
P(chores pending action \| WIP)	0.007 (=0.300.190.18[1-0.13][1-0.20])
P(chores pending action \| Personal)	0.007 (=0.260.180.22[1-0.11][1-0.24])
P(WIP)	0.47 (=140/[140+160])
P(Personal)	0.53 (=160/[140+160])
P(WIP \| chores pending action)	0.48 (=[0.0070.47]/[(0.0070.47)+(0.007*0.53)]
P(Personal \| chores pending action)	0.52 (=[0.0070.53]/[(0.0070.47)+(0.007*0.53)]

Batch 1+2+3 probability calculations

Based on our updated calculations:

P(WIP | chores pending action) = 0.48

And,

P(Personal | chores pending action) = 0.52

We now see that the classification for our email changes—from WIP to Personal—based on the updated supervised learning of our classifier.

Labeled training data for supervised learning

This example shows how supervised learning works for simple naive Bayes classification.

The important thing to note is that the training data is labeled with the classes that each data point belongs to. This labeling guides—or supervises—the training of the classifier, since the classifier will learn to classify new data based on these labels.

In summary

Naive Bayes classification is a form of supervised learning—it uses training data that is pre-labeled with the available classifications
Unsupervised learning, in contrast to naive Bayes classification, does not use pre-labeled data and has no pre-determined knowledge about how to classify the data
Naive Bayes classification assumes conditional independence between input variables—this makes calculations much simpler
Through supervised learning, simple naive Bayes classifiers learn the probabilities that underpin classification decisions (or the parameters of assumed generative distributions in more complex versions of naive Bayes classification)
As more pre-labeled training data becomes available, naive Bayes classifiers can update their probability estimates through continued supervised learning

Is Naive Bayes Classification Supervised Or Unsupervised?

Simple naive Bayes classification