# Is Latent Dirichlet Allocation (LDA) A Clustering Algorithm?

*How LDA is different—and similar—to clustering algorithms.*

Strictly speaking, Latent Dirichlet Allocation (LDA) is **not** a clustering algorithm. This is because clustering algorithms produce *one grouping* per item being clustered, whereas LDA produces a *distribution of groupings* over the items being clustered.

Consider k-means, for instance, a popular clustering algorithm. If we use k-means to cluster some items, say a set of documents, then the algorithm will produce *one cluster group per document*. That is, each document will be assigned to a single cluster.

If, instead, we were to apply LDA to a set of documents, we would end up with a *probability distribution of groupings*—or *topics*—for each document. In other words, each document will be assigned to distribution of topics, with each topic having a probability associated with it.

Since LDA produces *more than one topic *per document, it is not considered to be a true—or *hard*—clustering algorithm. It is, however, sometimes referred to as a **soft** clustering algorithm since it does produce topic groupings after all.

**What is clustering?**

Clustering is an approach for grouping a set of items based on some sort of **similarity measure**. The items *within* a group are considered to be more *similar* to each other than the items in other groups.

K-means is a popular clustering algorithm that groups data into *k* specific clusters. The number of clusters—*k*—needs to be specified in advance.

K-means clustering uses a *distance calculation* for determining the similarity between data items. Technically, k-means produces clusters that **minimize** the *squared Euclidean distances* between the items within clusters.

**What is LDA?**

Latent Dirichlet Allocation a form of *unsupervised learning* that’s used for discovering **latent** (hidden) groupings within data. The discovered groupings are often referred to as **topics**. It’s a versatile algorithm that’s widely used in data analysis, particularly in natural language processing and text analysis.

LDA uses *Dirichlet distributions* to help discover the topics in a data set.

There are two Dirichlet distributions used in the LDA algorithm. In the case of document analysis, there’s a distribution over the **topics in each document** and a distribution over the **words in each topic**.

The number of topics needs to be specified in advance when using LDA, just as the number of clusters (k) needs to be specified when using k-means.

The LDA topic and word distributions are based on **frequency counts** of the topics and words in the set of documents being analyzed.

**How does LDA differ to (hard) clustering?**

As mentioned, LDA assigns a **distribution of topics** over data rather than a single topic or grouping—this is the *key difference* between LDA and strict (or *hard*) clustering algorithms such as k-means.

Also, while clustering algorithms use some sort of *distance measure* to determine similarity within a grouping, LDA does not. In document analysis, LDA uses *frequency counts* of the words in the documents to determine *similarity* within topics.

LDA does, however, have some similarities with clustering algorithms such as k-means. Both require the *number of topics* (or groupings) to be *specified in advance*. Also, both are forms of *unsupervised learning* since there are no pre-assigned data groupings (labels) used for training the algorithms—they both *discover groupings* in unstructured data.

**In summary**

- LDA is
**not**a clustering algorithm in the strictest sense - Clustering algorithms typically produce
**one grouping**per item being clustered whereas LDA produces a**probability distribution of groupings**—in document analysis, clustering algorithms will produce*one grouping per document*while LDA will produce a*probability distribution of groupings (topics) per document* - Clustering algorithms determine
**similarity**between data items using some sort of**distance measure**(eg.*Euclidean*distance) whereas LDA uses**frequency counts**to determine similarity - Clustering algorithms and LDA both require the
**number of groupings to be specified in advance**and both are forms of**unsupervised learning**since they do not require pre-grouped data examples for training—they**discover groupings**in*unstructured*data