Is Topic Modeling Unsupervised?
A clear explanation on whether topic modeling is a form of supervised or unsupervised learning.
Topic modeling is a form of unsupervised learning.
It’s a branch of natural language processing that’s used for exploring unstructured data, typically text.
Topic modeling can be applied directly to the data being analyzed. It does not require labeled data or pre-training for its learning algorithm. This is why it is a form of unsupervised learning.
Being unsupervised, topic modeling is useful when annotated (labeled) data isn’t available. This is a major advantage of topic modeling, as most of the data that we encounter isn’t labeled, and labeling is time-consuming and expensive to do.
What is unsupervised learning?
Unsupervised learning refers to learning directly from unstructured or unlabeled data, without the need for additional guidance (labels) or pre-training.
Topic modeling is one amongst many examples of unsupervised learning. Other examples include clustering, anomaly detection, dimensionality reduction and association rule learning.
In contrast to unsupervised learning, supervised learning requires labeled data for training the learning algorithm. Typical examples of supervised learning include classification and regression.
How do topic models work (without supervision)?
To understand how topic modeling learns without supervision, let’s look at an example.
Consider applying topic modeling to earnings call transcripts.
Earnings calls are hosted each quarter by US listed corporations and they’re an important feature of the US financial calendar.
Financial analysts are keen to make sense of earnings call transcripts in a timely manner—it can give them an edge in their investment activities.
The sheer volume of text data generated by earnings call transcripts, however, is large, and in their raw form they have no annotation or labeling. Supervised machine learning approaches can’t be used to analyze them (without modification).
But topic modeling can be used directly on the transcripts, since it’s unsupervised, and it’s a useful way to automate the analysis of earnings calls.
Assuming a Latent Dirichlet Allocation (LDA) approach—a popular topic modeling algorithm—we can analyze a set of earnings call transcripts as follows:
- Collect the set of transcripts being analyzed—let’s call this the ‘corpus’—and apply text pre-processing such as cleaning (removing special characters, spaces, stop-words and punctuation), lemmatizing and selecting the parts-of-speech that you wish to retain.
- Select the number of topics you wish to end up with—this is a requirement of LDA—let’s call this K.
- Run the LDA algorithm—it works through an iterative process (see below).
- Interpret the generated topics—the algorithm will provide a set of K topics that’s entirely generated from the corpus you provided.
The set of K topics that the LDA algorithm generates represents collections of words that go together.
But how does the algorithm know which words go together?
It uses a probabilistic generative approach, or in other words, it analyzes the words in the corpus and infers the relationships that exist between them.
It does this by considering, with each iteration, how the words in the corpus are distributed in each topic, and how the topics are distributed amongst the documents in the corpus.
Notice that all the algorithm requires is the corpus—it doesn’t need additional data or guidance (labels) to determine what to do—this is why it’s a form of unsupervised learning.
Once a topic model has been run, it can of course be applied to a new set of documents. So, the original corpus can be considered to have trained the topic model in this instance.
But the important thing to remember is that the topic model can be applied directly to a new set of documents with no additional guidance—hence it’s unsupervised.
Curious? To learn more about topic modeling:
- Here’s the full hands-on description (with Python code) of applying topic modeling to earnings call transcripts
- Here’s an intuitive explanation of how LDA topic modeling works—step by step and with no math
- Here’s an explanation of how to evaluate topic models—an important and sometimes overlooked aspect of topic modeling
- Topic modeling is a form of unsupervised learning
- It can be applied directly to a new set of text documents without pre-training or guidance (labels)
- Topic modeling works by inferring the relationships that exist—between words in topics and topics in documents—within the set of text documents being analyzed
- Topic modeling is useful in situations where pre-labeled data doesn’t exist—which is the case for most emerging data—hence it is a versatile way to analyze unstructured text data