Natural Language Processing

Is Latent Dirichlet Allocation (LDA) A Clustering Algorithm?

How LDA is different—and similar—to clustering algorithms.

Strictly speaking, Latent Dirichlet Allocation (LDA) is not a clustering algorithm. This is because clustering algorithms produce one grouping per item being clustered, whereas LDA produces a distribution of groupings over the items being clustered.

Consider k-means, for instance, a popular clustering algorithm. If we use k-means to cluster some items, say a set of documents, then the algorithm will produce one cluster group per document. That is, each document will be assigned to a single cluster.

If, instead, we were to apply LDA to a set of documents, we would end up with a probability distribution of groupings—or topics—for each document. In other words, each document will be assigned to distribution of topics, with each topic having a probability associated with it.

Since LDA produces more than one topic per document, it is not considered to be a true—or hard—clustering algorithm. It is, however, sometimes referred to as a soft clustering algorithm since it does produce topic groupings after all.

What is clustering?

Clustering is an approach for grouping a set of items based on some sort of similarity measure. The items within a group are considered to be more similar to each other than the items in other groups.

K-means is a popular clustering algorithm that groups data into k specific clusters. The number of clusters—k—needs to be specified in advance.

K-means clustering uses a distance calculation for determining the similarity between data items. Technically, k-means produces clusters that minimize the squared Euclidean distances between the items within clusters.

What is LDA?

Latent Dirichlet Allocation a form of unsupervised learning that’s used for discovering latent (hidden) groupings within data. The discovered groupings are often referred to as topics. It’s a versatile algorithm that’s widely used in data analysis, particularly in natural language processing and text analysis.

LDA uses Dirichlet distributions to help discover the topics in a data set.

There are two Dirichlet distributions used in the LDA algorithm. In the case of document analysis, there’s a distribution over the topics in each document and a distribution over the words in each topic.

The number of topics needs to be specified in advance when using LDA, just as the number of clusters (k) needs to be specified when using k-means.

The LDA topic and word distributions are based on frequency counts of the topics and words in the set of documents being analyzed.

How does LDA differ to (hard) clustering?

As mentioned, LDA assigns a distribution of topics over data rather than a single topic or grouping—this is the key difference between LDA and strict (or hard) clustering algorithms such as k-means.

Also, while clustering algorithms use some sort of distance measure to determine similarity within a grouping, LDA does not. In document analysis, LDA uses frequency counts of the words in the documents to determine similarity within topics.

LDA does, however, have some similarities with clustering algorithms such as k-means. Both require the number of topics (or groupings) to be specified in advance. Also, both are forms of unsupervised learning since there are no pre-assigned data groupings (labels) used for training the algorithms—they both discover groupings in unstructured data.

In summary

  • LDA is not a clustering algorithm in the strictest sense
  • Clustering algorithms typically produce one grouping per item being clustered whereas LDA produces a probability distribution of groupings—in document analysis, clustering algorithms will produce one grouping per document while LDA will produce a probability distribution of groupings (topics) per document
  • Clustering algorithms determine similarity between data items using some sort of distance measure (eg. Euclidean distance) whereas LDA uses frequency counts to determine similarity
  • Clustering algorithms and LDA both require the number of groupings to be specified in advance and both are forms of unsupervised learning since they do not require pre-grouped data examples for training—they discover groupings in unstructured data

Similar Posts