Topic Modeling with LSA, pLSA, LDA, NMF, BERTopic, Top2Vec: a Comparison | by Nicolo Cosimo Albanese | Sep, 2022

By Jessie Hobb On Sep 20, 2022

A comparison between different topic modeling strategies including practical Python examples

Image by author.

Introduction
Topic Modeling Strategies
2.1 Introduction
2.2 Latent Semantic Analysis (LSA)
2.3 Probabilistic Latent Semantic Analysis (pLSA)
2.4 Latent Dirichlet Allocation (LDA)
2.5 Non-negative Matrix Factorization (NMF)
2.6 BERTopic and Top2Vec
Comparison
Additional remarks
4.1 A topic is not (necessarily) what we think it is
4.2 Topics are not easy to evaluate
Conclusions
References

In Natural Language Processing (NLP), the term topic modeling encompasses a series of statistical and Deep Learning techniques to find hidden semantic structures in sets of documents.

Topic modeling is an unsupervised Machine Learning problem. Unsupervised means that the algorithm learns patterns in absence of tags or labels.

Most of the information we generate and exchange as human beings has a textual nature. Documents, conversations, phone calls, messages, emails, notes, social media posts. The ability to automatically extract value from these sources in the absence of (or with limited) a priori knowledge is an everlasting and ubiquitous problem in Data Science.

In this post, we discuss popular approaches to topic modeling, from conventional algorithms to the most recent techniques based on Deep Learning. We aim at sharing a friendly introduction to these models, and comparing their advantages and disadvantages in practical applications.

We also provide end-to-end Python examples for the most predominant approaches.

Finally, we share some considerations on two of the most challenging aspects of unsupervised textual data analysis: the discrepancy between the human definition of “topic” and its statistical counterpart, and the difficulties associated to a quantitative assessment of topic models performances.

2.1 Introduction

Latent Semantic Analysis (LSA) (Deerwester¹ et al. 1990), probabilistic Latent Semantic Analysis (pLSA) (Hofmann², 1999), Latent Dirichlet Allocation (LDA) (Blei³ et al., 2003) and Non-Negative Matrix Factorization (Lee³ et al., 1999) are conventional and well-known approaches to topic modeling.

They represent a document as a bag-of-words and assume that each document is a mixture of latent topics.

They all start with the conversion of a textual corpus into a Document -Term Matrix (DTM), a table where each row is a document, and each column is a distinct word:

Document-Term Matrix from a sample set of documents. Image by author.

Note: implementations/research papers may also use/refer to the Term-Document Matrix (TDM), the transpose of the DTM.

Each cell <i, j> contains a count, i.e. how many times the word j appears in document i. A common alternative to the word count is the TF-IDF score. It considers both term frequency (TF) and inverse document frequency (IDF) to penalize the weight of terms that appear very often in the corpus, and increase the weight of more rare terms:

The basic principle behind the search of latent topics is the decomposition of the DTM into a document-topic and a topic-term matrix. The following methods differ in how they define and reach this goal.

2.2 Latent Semantic Analysis (LSA)

To the aim of decomposing the DTM and extract topics, Latent Semantic Analysis (LSA) applies a matrix factorization technique called Single Value Decomposition (SVD).

SVD decomposes the DTM into the product of three distinct matrices: DTM = U ∙ Σ ∙ Vᵗ, where

U and V are of size m x m and n x n respectively, being m the number of documents and n the number of words in the corpus.
Σ is m x n and only its main diagonal is populated: it contains the singular values of the DTM.

LSA selects the first t <= min(m, n) largest singular values of the DTM, and thus discards the last m - t and n - t columns of U and V, respectively. This procedure is known as truncated SVD. The resulting approximation of the DTM has rank t, as sketched in the image below.

The rank t approximation of the DTM is optimal in the sense that it is the closest rank t matrix to DTM in terms of L₂ norm. The remaining columns of U and V can be interpreted as document-topic and word-topic matrices, and t indicates the number of topics.

Truncated SVD on the Document-Term Matrix (DTM) to extract the latent variables (topics). Image by author.

Pros:

Intuitive.
It can be applied to both short and long documents.
Topics are open to human interpretation through the V matrix.

Cons:

The DTM disregards the semantic representation of words in a corpus. Similar concepts are treated as different matrix elements. Pre-processing techniques may help, but only to some extent. For example, stemming may help in treating “Italy” and “Italian” as similar terms (as they should be), but close words with a different stem like “money” and “cash” would still be considered as different. Moreover, stemming may also lead to less interpretable topics.
LSA requires an extensive pre-processing phase to obtain a significant representation from the textual input data.
The number of singular values t (topics) to maintain in the truncated SVD must be known a priori.
U and V may contain negative values. This poses a problem for interpretability (more on that in Paragraph 2.5).

2.3 Probabilistic Latent Semantic Analysis (pLSA)

Hofmann² (1999) proposed a variation of the LSA where the topics are estimated using a probabilistic model instead of SVD. Hence the name, probabilistic Latent Semantic Analysis (pLSA).

In particular, pLSA models the joint probability P(d, w) of seeing a word w and a document d together as a mixture of conditionally independent multinomial distributions:

where:

w indicates a word.
d indicates a document.
z indicates a topic.
P(z|d) is the probability of topic z being present in a document d.
P(w|z) is the probability of word w being present in a topic z.
We assume P(w|z, d) = P(w|z).

The previous expression can be re-written as:

We can draw an analogy between this expression and the previous formulation of the DTM decomposition, where:

P(d, w) corresponds to the DTM.
P(z) is analogous for the main diagonal of Σ.
P(d|z) and P(w|z) correspond to U and V, respectively.

The model can be fit using the expectation-maximization algorithm (EM). In brief, EM performs maximum likelihood estimation in the presence of latent variables (in this case, the topics).

Notably, the decomposition of the DTM relies on different objective functions. For LSA, it is the L₂ norm, while for pLSA it is the likelihood function. The latter aims at an explicit maximization of the predictive power of the model.

pLSA shares the same advantages and drawbacks with the LSA model, with some peculiar differences:

Pros:

pLSA showed better performances when compared to LSA (Hofmann², 1999).

Cons:

pLSA provides no probabilistic model at the level of documents. This implies that:

The number of parameters grows linearly with the number of documents, leading to problems with scalability and overfitting.
It cannot assign probabilities to new documents.

2.4 Latent Dirichlet Allocation (LDA)

The Latent Dirichlet Allocation (LDA) (Blei³ et al., 2003) improves pLSA by using Dirichlet priors to estimate the document-topic and term-topic distributions in a Bayesian approach.

The Dirichlet distribution Dir(α) is a family of continuous multivariate probability distributions parameterized by a vector α of positive reals.

Let us imagine a newspaper with three sections: politics, sports and arts, each section also representing a topic. The hypothetical distribution of the topics mixture in the newspaper sections is an example of Dirichlet distribution:

Section 1 (politics):

Topics mixture: politics 0.99, sports 0.005, arts 0.005.

Section 2 (sports):

Topics mixture: politics 0.005, sports 0.99, arts 0.005.

Section 3 (arts):

Topics mixture: politics 0.005, sports 0.005, arts 0.99.

Let us observe the plate notation (a conventional method to represent variables in a graphical model) of the LDA to explain the use of Dirichlet priors:

Plane notation of LDA. From Barbieri⁵ (2013). Grey circles indicate observed variables (words in the corpus), while white circles denote latent variables.

M indicates the number of documents and N the number of words in a document. From the top, we observe α, the parameter of the Dirichlet prior on the per-document topic distributions. From the Dirichlet distribution Dir(α), we draw a random sample representing the topic distribution θ for a document. As if, in our newspaper example, we were taking a mixture (0.99 politics, 0.05 sports, 0.05 arts) describing the distribution of topics for an article.

From the selected mixture θ, we draw a topic z based on the distribution (in our example, politics). From the bottom, we observe β, the parameters of the Dirichlet prior on the per-topic word distribution. From the Dirichlet distribution Dir(𝛽), we choose a sample representing the word distribution φ given the topic z. And, from φ, we draw a word w.

In the end, we are interested in estimating the probability of a topic z given a document d and the parameters α and 𝛽, i.e. P(z|d, α, 𝛽). The problem is formulated as the calculation of the posterior distribution of the hidden variables given a document:

Since this distribution is intractable to compute, Blei³ et al. (2013) suggested the use of an approximate inference algorithm (variational approximation). The optimizing values are found by minimizing the Kullback-Leibler divergence between the approximate distribution and the true posterior P(θ, z|d, α, 𝛽). Once we get the optimal parameters for our data, we can again computeP(z|d, α, 𝛽), which, in a sense, corresponds to the document-topic matrix U. Each entry of 𝛽₁, 𝛽₂, ..., 𝛽ₜ is p(w|z), which corresponds to the term-topic matrix V. The main difference is that, much like in pLSA, the matrix coefficients have a statistical interpretation.

Pros:

It provides better performances than LSA and pLSA.
Unlike pLSA, LDA can assign a probability to a new document thanks to the document-topic Dirichlet distribution.
It can be applied to both short and long documents.
Topics are open to human interpretation.
As a probabilistic module, LDA can be embedded in more complex models or extended. Studies that followed the original work by Blei³ et al. (2013) extended LDA and addressed some original limitations.

Cons:

The number of topics must be known beforehand.
The bag-of-words approach disregards the semantic representation of words in a corpus, similarly to LSA and pLSA.
The estimation of Bayes parameters α and β lies under the assumption of exchangeability for the documents.
It requires an extensive pre-processing phase to obtain a significant representation from the textual input data.
Studies report LDA may yield too general (Rizvi⁶ et al., 2019) or irrelevant (Alnusyan⁷ et al., 2020) topics. Results may also be inconsistent across different executions (Egger⁸ et al., 2021).

Practical example with LDA

Popular LDA implementations are in the Gensim and sklearn packages (Python), and in Mallet (Java).

In the following example, we use the Gensim library with pyLDAvis for a visual topic exploration.

Interactive chart for topics exploration with pyLDAvis generated by the previous code snippet. Image by author.

2.5 Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization (NMF), introduced by Lee⁴ et al. (1999), is a variation of LSA.

LSA leverages SVD to decompose the Document-Term Matrix and extract latent information (the topics). A property of SVD is that the basis vectors are orthogonal to each other, forcing some elements in the bases to be negative.

In brief, factorizations with negative matrix coefficients (like SVD) pose a problem for interpretability. Subtractive combinations do not allow to understand how a component contributes to the whole. NMF decomposes the Document-Term Matrix into a topic-documents matrix U and a topic-term matrix Vᵗ, much like SVD, but with the additional constraint that U and Vᵗ can only contain non-negative elements.

Moreover, while we exploited a decomposition of the form U ∙ Σ ∙ Vᵗ, in the case of non-negative matrix factorization, this becomes: U ∙ Vᵗ.

The decomposition of the DTM can be posed as an optimization problem that aims at minimizing the difference between the DTM and its approximation. Frequently adopted distance measures are the Frobenius norm and the Kullback-Leibler divergence.

NMF shares the same main advantages and drawbacks of the other classical models (bag-of-words approach, need of pre-processing, …), but with some peculiar traits:

Pros:

Literature argues the superiority of NMF compared to SVD (hence LSA) in producing more interpretable and coherent topics (Lee⁴ et al. 1999, Xu⁹ et al. 2003; Casalino¹⁰ et al. 2016).

Cons:

The non-negativity constraints make the decomposition more difficult and may lead to inaccurate topics.
NMF is a non-convex problem. Different U and Vᵗ may approximate the DTM, leading to potentially inconsistent outcomes for different runs.

Practical example with NMF

2.6 BERTopic and Top2Vec

Grootendorst¹¹ (2022) and Angelov¹² (2020) proposed novel approaches to topic modeling, BERTopic and Top2Vec respectively. These models address the limitations of conventional strategies discussed so far. We explore them together in the following paragraphs.

2.6.1 Document embedding

BERTopic and Top2Vec manufacture semantic embeddings from input documents.

In the original papers, BERTopic leveraged BERT Sentence Transformers (SBERT) to manufacture high-quality, contextual word and sentence vector representations. Instead, Top2Vec used Doc2Vec to create jointly embedded word, document, and topic vectors.

At the moment of this writing, both algorithms support a variety of embedding strategies, although BERTopic has a broader coverage of embedding models:

Embedding models currently supported by BERTopic and Top2Vec. The references inside the table provide more detailed information.

2.6.2 Dimensionality reduction with UMAP

One may apply a clustering algorithm to the embeddings directly, but this would increase computational expenses and lead to poor clustering performances (due to the “curse of dimensionality”).

Therefore, a dimensionality reduction technique is applied before clustering. UMAP (Uniform Manifold Approximation and Projection) (McInnes¹³ et al., 2018) provides several benefits:

It preserves more of the local and global features of high-dimensional data in lower projected dimensions (McInnes¹³ et al., 2018).
UMAP has no computational restrictions on embedding dimensions (McInnes¹³ et al., 2018). Therefore, it can be used effectively with different document embedding strategies.
Reducing embedding dimensionality with UMAP improves the clustering performances of K-Means and HDBSCAN in terms of accuracy and time (Allaoui¹⁴ et al., 2020).
UMAP can easily scale to large datasets (Angelov¹², 2020).

2.6.3 Clustering

Both BERTopic and Top2Vec originally leveraged HDBSCAN (McInnes¹⁵ et al., 2017) as clustering algorithm.

Pros:

HDBSCAN inherits the benefits of DBSCAN and improves it (McInnes¹⁵ et al., 2017).
HDBSCAN (as DBSCAN) does not force observations into a cluster. It models unrelated observations as outliers. This improves topic representation and coherence.

Cons:

Modeling unrelated documents as outliers may result in information loss. Outliers may become a relevant portion of the original corpus in noisy datasets.

BERTopic currently supports also K-Means and agglomerative clustering algorithms, providing flexibility of choice. K-Means allows to select the desired number of clusters and forces every document into a cluster. This avoids the generation of outliers, but may also result in poorer topic representation and coherence.

2.6.4. Topic representation

BERTopic and Top2Vec differ from each other in how they manufacture a representation for the topics.

BERTopic concatenates all documents within the same cluster (topic) and applies a modified TF-IDF. In brief, it replaces documents with clusters in the original TF-IDF formula. Then, it uses the first most important words for each cluster as topic representation.

This score is called class-based TF-IDF (c TF-IDF), as it estimates the importance of words in clusters instead of documents.

Top2Vec, instead, manufactures a representation with the words closest to the cluster’s centroid. In particular, for each dense area obtained through HDBSCAN, it calculates the centroid of document vectors in original dimension, then selects the most proximal word vectors.

Pros of BERTopic and Top2Vec:

The number of topics is not necessarily given beforehand. Both BERTopic and Top2Vec support for hierarchical topic reduction to optimize the number of topics.
High-quality embeddings take into account the semantic relationship between words in a corpus, unlike the bag-of-words approach. This leads to better and more informative topics.
Due to the semantic nature of embeddings, textual pre-processing (stemming, lemmization, stopwords removal, …) is not needed in most cases.
BERTopic supports dynamic topic modeling.
Modularity. Each step (document embedding, dimensionality reduction, clustering) is virtually self-consistent and can change or evolve depending on the advancements in the field, the peculiarities of a specific project or technical constraints. One may, for example, use BERTopic with Doc2Vec embeddings instead of SBERT, or apply K-Means clustering instead of HDBSCAN.
They scale better with larger corpora compared with conventional approaches (Angelov¹², 2020).
Both BERTopic and Top2Vec provide advanced built-in search and visualization capabilities. They make simpler to investigate the quality of the topics and drive further optimization, as well as producing high-quality charts for presentations.

Cons of BERTopic and Top2Vec:

They work better on shorter text, such as social media posts or news headlines. Most transformers-based embeddings have a limit on the number of tokens they can consider when they build a semantic representation. It is possible to use these algorithms with longer documents. One may, for example, split the documents in sentences or paragraphs before the embedding step. Nevertheless, this may not necessarily benefit the generation of meaningful and representative topics for longer documents.
Each document is assigned to one topic only. Traditional approaches like LDA, instead, were built on the assumption that each document contains a mixture of topics.
They are slower compared to conventional models (Grootendorst¹¹, 2022). Moreover, a faster training and inference may require more expensive hardware accelerators (GPU).
Although BERTopic leverages transformers-based large language models to manufacture document embeddings, the topic representation still uses a bag-of-word approach (c TF-IDF).
They may be less effective for small datasets (<1000 docs) (Egger¹⁶ et al., 2022).

Practical example with BERTopic

Intertopic distance map generated by the previous code snippet. The visualization is similar to the one obtained by pyLDAvis. Image by author.

Documents projections (detail) generated by the previous code snippet. Image by author.

Topics hierarchical structure (dendrogram) generated by the previous code snippet. Image by author.

Practical example with Top2Vec

Word clouds of the closest five topics to the input query “faith” generated by the previous code snippet. Image by author.

The following table summarizes the salient traits of the different topic modeling strategies considering practical application scenarios:

Comparison between different topic modeling techniques. Note: LSA and pLSA were not included, as LDA overcomes their limitations and it is considered the best approach among the three. Table by author.

This summary table provides high-level selection criteria for a given use case.

Let’s share some examples.

Imagine the need to find trending topics in Tweets with little pre-processing effort. In this case, one may choose to use Top2Vec and BERTopic. They work splendidly on shorter textual sources and do not require much pre-processing.

Instead, imagine a scenario where a customer is interested in finding how a given document may contain a mixture of multiple topics. In this case, approaches like LDA and NMF would be preferable. BERTopic and Top2Vec assign a document to one topic only. Although the probability distribution of the HDBSCAN may be used as proxy of the topics distribution, BERTopic and Top2Vec are not mixed membership models by design.

When discussing topic modeling, there are two major points of attention worth mentioning at the end of our journey.

4.1 A topic is not (necessarily) what we think it is

When we come across a magazine in a waiting room, we know at glance what genre it belongs. When we enter a conversation, few sentences are enough to let us guess the object of discussion. This is a “topic” from the perspective of a human being.

Unfortunately, the term “topic” assumes a completely different meaning in the models discussed so far.

Let us remember the Document-Term Matrix. At high level, we want to decompose it as the product of a document-topic and topic-term matrix, and extract the latent dimension — topics — in the process. The goal of these strategies (like LSA) is the minimization of the decomposition error.

Probabilistic generative models (like LDA) add an additional layer of statistical formalism with a robust and elegant Bayesian approach, but what they really try to do is to recreate the original document-word distribution with minimal error.

None of these models ensure that the obtained topics will be informative or useful from a human point of view.

In the beautiful words of Blei³ et al. (2013):

“We refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but we make no epistemological claims regarding these latent variables beyond their utility in representing probability distributions on sets of words.”

On the other hand, BERTopic and Top2Vec leverage semantic embeddings. Therefore, the vectors used to represent documents carry a proxy (the closest we have so far) of their “meaning” from a “human” perspective. These amazing models assume that clustering a projection of these embeddings may lead to more meaningful and specific topics.

Studies (to cite a few: Grootendorst¹¹ 2022, Angelov¹² 2020, Egger¹⁶ et al. 2022) show that topics obtained leveraging semantic embeddings are more informative and coherent across several domains.

But even in this case, there is still an underlying mismatch between the definition of topic for a human and for such models. What these algorithms produce is “interpretable, well represented and coherent groups of semantically similar documents”.

Make no mistake: this is an outstanding and unique result that opened an entirely new frontier in the field and achieved unprecedented performances.

But we may still debate on how this approximates the human definition of topic, and under what circumstances.

And if you think this is a trivial subtlety, have you ever tried to explain a topic like “mail_post_email_posting” to business stakeholders? Yes, it is coherent and interpretable, but is it what they think of when they imagine a “topic”?

This leads us to the second point of attention.

4.2 Topics are not easy to evaluate

Topic modeling is an unsupervised technique. There are no labels to rely on during evaluation.

Coherence measures have been proposed to evaluate topics quality with respect to interpretability. For example, Normalized pointwise mutual information (NPMI) (Bouma¹⁷, 2009) estimates how likely is the co-occurrence of two words x and y than we would expect by chance:

NPMI can vary from -1 (no co-occurrences) to +1 (perfect co-occurrence). Independence between the occurrences of x and y results in NPMI=0.

Lau¹⁸ et al. (2014) suggest that this metric emulates human judgement to a reasonable extent.

Other coherence measures also exist. For example, Cv (Röder¹⁹ et al. 2015) and UMass (Mimno²⁰ et al., 2011).

These coherence metrics suffer from a series of shortcomings:

There is no shared convention on which metric to use for qualitative performance (Zuo²¹ et al., 2016, Blair²² et al., 2020, Doogan²³ et al., 2021).
Blair²² et al., (2020) reported inconsistent results between different coherence measures.
Doogan²³ et al. (2021) showed the unreliability of coherence measures for evaluating topic models for specific domains (Twitter data).
Hoyle²⁴ et al., 2021 suggests that metrics like NPMI may fail to assess interpretability of neural topic models.
The use of Cv was discouraged²⁵ by its author due to reported inconsistencies.

As Grootendorst¹¹ (2022) writes:

“Validation measures such are topic coherence and topic diversity are proxies of what is essentially a subjective evaluation. One user might judge the coherence and diversity of a topic differently from another user. As a result, although these measures can be used to get an indication of a model’s performance, they are just that, an indication”.

In conclusion, validation measures fail to estimate a topic model performance with clarity. They do not provide the unequivocal interpretation as an accuracy or F1 score would for a classification problem. As a consequence, the quantification of a “measure of goodness” for obtained topics still requires domain knowledge and human evaluation. The assessment of business value (“Will these topics benefit the project?”) is no trivial feat, and may need composite metrics and an holistic approach.

In this post, we shared a friendly overview of popular topic modeling algorithms, from generative statistical models to transformers-based approaches.

We also provided a table highlighting advantages and disadvantages of each technique. This could be used for comparison and to aid a preliminary model selection in different scenarios.

Finally, we shared two of the most challenging aspects of unsupervised textual data analysis.

At first, the difference, so often overlooked, between the human definition of “topic” and its statistical counterpart as result of a “topic modeling” algorithm. The comprehension of this discrepancy is paramount to meet project goals and guide the expectations of business stakeholders in NLP endeavours.

Then, we discussed the difficulty to quantitatively assess topic models performances, by introducing popular metrics and their shortcomings.

[1] Deerwester et al., Indexing by latent semantic analysis, Journal of the American Society for Information Science, Volume 41, Issue 6 p. 391–407, 1990 (link).

[2] Hofmann, Probabilistic Latent Semantic Analysis, Proceedings of the XV Conference on Uncertainty in Artificial Intelligence (UAI1999), 1999 (link).

[3] Blei et al., Latent dirichlet allocation, The Journal of Machine Learning Research, Volume 3, p. 993–1022, 2003 (link).

[4] Lee et al., Learning the parts of objects by non-negative matrix factorization, Nature, Volume 401, p. 788–791, 1999 (link).

[5] Barbieri et al., Probabilistic topic models for sequence data, Machine Learning, Volume 93, p. 5–29, 2013 (link).

[6] Rizvi et al., Analyzing social media data to understand consumers’ information needs on dietary supplements, Stud. Health Technol. Inform., Volume 264, p. 323–327, 2019 (link).

[7] Alnusyan et al., A semi-supervised approach for user reviews topic modeling and classification, International Conference on Computing and Information Technology, 1–5, 2020 (link).

[8] Egger and Yu, Identifying hidden semantic structures in Instagram data: a topic modelling comparison, Tour. Rev. 2021:244, 2021 (link).

[9] Xu et al., Document clustering based on non-negative matrix factorization, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, p. 267–273, 2003 (link).

[10] Casalino et al., Nonnegative matrix factorizations for intelligent data analysis, Non-negative Matrix Factorization Techniques. Springer, p. 49–74, 2016 (link).

[11] Grootendorst, BERTopic: Neural topic modeling with a class-based TF-IDF procedure, 2022 (link).

[12] Angelov, Top2Vec: Distributed Representations of Topics, 2020 (link).

[13] McInnes et al., UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, 2018 (link).

[14] Allaoui et al., Considerably improving clustering algorithms using umap dimensionality reduction technique: A comparative study, International Conference on Image and Signal Processing, Springer, p. 317–325, 2020 (link).

[15] McInnes et al., hdbscan: Hierarchical density based clustering, The Journal of Open Source Software, 2(11):205, 2017 (link).

[16] Egger et al., A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts, Frontiers in Sociology, Volume 7, Article 886498, 2022 (link).

[17] Bouma, Normalized (pointwise) mutual information in collocation extraction, Proceedings of GSCL, 30:31–40, 2009 (link).

[18] Lau et al., Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, p. 530–539, 2014 (link).

[19] Röder et al., Exploring the space of topic coherence measures, Proceedings of the eighth ACM international conference on Web search and data mining, p. 399–408. ACM, 2015 (link).

[20] Mimno et al., Optimizing semantic coherence in topic models, Proc. of the Conf. on Empirical Methods in Natural Language Processing, p. 262–272, 2011 (link).

[21] Y. Zuo et al., Word network topic model: a simple but general solution for short and imbalanced texts, Knowledge and Information Systems, 48(2), p. 379–398 (link)

[22] Blair et al., Aggregated topic models for increasing social media topic coherence, Applied Intelligence, 50(1), p. 138–156, 2020 (link).

[23] Doogan et al., Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p. 3824–3848, 2021 (link).

[24] Hoyle et al., Is automated topic model evaluation broken? the incoherence of coherence, Advances in Neural Information Processing Systems, 34, 2021 (link).

[25] https://github.com/dice-group/Palmetto/issues/13

A comparison between different topic modeling strategies including practical Python examples

Introduction
Topic Modeling Strategies
2.1 Introduction
2.2 Latent Semantic Analysis (LSA)
2.3 Probabilistic Latent Semantic Analysis (pLSA)
2.4 Latent Dirichlet Allocation (LDA)
2.5 Non-negative Matrix Factorization (NMF)
2.6 BERTopic and Top2Vec
Comparison
Additional remarks
4.1 A topic is not (necessarily) what we think it is
4.2 Topics are not easy to evaluate
Conclusions
References

In Natural Language Processing (NLP), the term topic modeling encompasses a series of statistical and Deep Learning techniques to find hidden semantic structures in sets of documents.

Topic modeling is an unsupervised Machine Learning problem. Unsupervised means that the algorithm learns patterns in absence of tags or labels.

We also provide end-to-end Python examples for the most predominant approaches.

2.1 Introduction

They represent a document as a bag-of-words and assume that each document is a mixture of latent topics.

They all start with the conversion of a textual corpus into a Document -Term Matrix (DTM), a table where each row is a document, and each column is a distinct word:

Note: implementations/research papers may also use/refer to the Term-Document Matrix (TDM), the transpose of the DTM.

2.2 Latent Semantic Analysis (LSA)

To the aim of decomposing the DTM and extract topics, Latent Semantic Analysis (LSA) applies a matrix factorization technique called Single Value Decomposition (SVD).

SVD decomposes the DTM into the product of three distinct matrices: DTM = U ∙ Σ ∙ Vᵗ, where

U and V are of size m x m and n x n respectively, being m the number of documents and n the number of words in the corpus.
Σ is m x n and only its main diagonal is populated: it contains the singular values of the DTM.

Pros:

Intuitive.
It can be applied to both short and long documents.
Topics are open to human interpretation through the V matrix.

Cons:

The DTM disregards the semantic representation of words in a corpus. Similar concepts are treated as different matrix elements. Pre-processing techniques may help, but only to some extent. For example, stemming may help in treating “Italy” and “Italian” as similar terms (as they should be), but close words with a different stem like “money” and “cash” would still be considered as different. Moreover, stemming may also lead to less interpretable topics.
LSA requires an extensive pre-processing phase to obtain a significant representation from the textual input data.
The number of singular values t (topics) to maintain in the truncated SVD must be known a priori.
U and V may contain negative values. This poses a problem for interpretability (more on that in Paragraph 2.5).

2.3 Probabilistic Latent Semantic Analysis (pLSA)

Hofmann² (1999) proposed a variation of the LSA where the topics are estimated using a probabilistic model instead of SVD. Hence the name, probabilistic Latent Semantic Analysis (pLSA).

In particular, pLSA models the joint probability P(d, w) of seeing a word w and a document d together as a mixture of conditionally independent multinomial distributions:

where:

w indicates a word.
d indicates a document.
z indicates a topic.
P(z|d) is the probability of topic z being present in a document d.
P(w|z) is the probability of word w being present in a topic z.
We assume P(w|z, d) = P(w|z).

The previous expression can be re-written as:

We can draw an analogy between this expression and the previous formulation of the DTM decomposition, where:

P(d, w) corresponds to the DTM.
P(z) is analogous for the main diagonal of Σ.
P(d|z) and P(w|z) correspond to U and V, respectively.

The model can be fit using the expectation-maximization algorithm (EM). In brief, EM performs maximum likelihood estimation in the presence of latent variables (in this case, the topics).

pLSA shares the same advantages and drawbacks with the LSA model, with some peculiar differences:

Pros:

pLSA showed better performances when compared to LSA (Hofmann², 1999).

Cons:

pLSA provides no probabilistic model at the level of documents. This implies that:

The number of parameters grows linearly with the number of documents, leading to problems with scalability and overfitting.
It cannot assign probabilities to new documents.

2.4 Latent Dirichlet Allocation (LDA)

The Latent Dirichlet Allocation (LDA) (Blei³ et al., 2003) improves pLSA by using Dirichlet priors to estimate the document-topic and term-topic distributions in a Bayesian approach.

The Dirichlet distribution Dir(α) is a family of continuous multivariate probability distributions parameterized by a vector α of positive reals.

Section 1 (politics):

Topics mixture: politics 0.99, sports 0.005, arts 0.005.

Section 2 (sports):

Topics mixture: politics 0.005, sports 0.99, arts 0.005.

Section 3 (arts):

Topics mixture: politics 0.005, sports 0.005, arts 0.99.

Let us observe the plate notation (a conventional method to represent variables in a graphical model) of the LDA to explain the use of Dirichlet priors:

Pros:

It provides better performances than LSA and pLSA.
Unlike pLSA, LDA can assign a probability to a new document thanks to the document-topic Dirichlet distribution.
It can be applied to both short and long documents.
Topics are open to human interpretation.
As a probabilistic module, LDA can be embedded in more complex models or extended. Studies that followed the original work by Blei³ et al. (2013) extended LDA and addressed some original limitations.

Cons:

The number of topics must be known beforehand.
The bag-of-words approach disregards the semantic representation of words in a corpus, similarly to LSA and pLSA.
The estimation of Bayes parameters α and β lies under the assumption of exchangeability for the documents.
It requires an extensive pre-processing phase to obtain a significant representation from the textual input data.
Studies report LDA may yield too general (Rizvi⁶ et al., 2019) or irrelevant (Alnusyan⁷ et al., 2020) topics. Results may also be inconsistent across different executions (Egger⁸ et al., 2021).

Practical example with LDA

Popular LDA implementations are in the Gensim and sklearn packages (Python), and in Mallet (Java).

In the following example, we use the Gensim library with pyLDAvis for a visual topic exploration.

2.5 Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization (NMF), introduced by Lee⁴ et al. (1999), is a variation of LSA.

Moreover, while we exploited a decomposition of the form U ∙ Σ ∙ Vᵗ, in the case of non-negative matrix factorization, this becomes: U ∙ Vᵗ.

NMF shares the same main advantages and drawbacks of the other classical models (bag-of-words approach, need of pre-processing, …), but with some peculiar traits:

Pros:

Literature argues the superiority of NMF compared to SVD (hence LSA) in producing more interpretable and coherent topics (Lee⁴ et al. 1999, Xu⁹ et al. 2003; Casalino¹⁰ et al. 2016).

Cons:

The non-negativity constraints make the decomposition more difficult and may lead to inaccurate topics.
NMF is a non-convex problem. Different U and Vᵗ may approximate the DTM, leading to potentially inconsistent outcomes for different runs.

Practical example with NMF

2.6 BERTopic and Top2Vec

2.6.1 Document embedding

BERTopic and Top2Vec manufacture semantic embeddings from input documents.

At the moment of this writing, both algorithms support a variety of embedding strategies, although BERTopic has a broader coverage of embedding models:

Embedding models currently supported by BERTopic and Top2Vec. The references inside the table provide more detailed information.

2.6.2 Dimensionality reduction with UMAP

One may apply a clustering algorithm to the embeddings directly, but this would increase computational expenses and lead to poor clustering performances (due to the “curse of dimensionality”).

Therefore, a dimensionality reduction technique is applied before clustering. UMAP (Uniform Manifold Approximation and Projection) (McInnes¹³ et al., 2018) provides several benefits:

It preserves more of the local and global features of high-dimensional data in lower projected dimensions (McInnes¹³ et al., 2018).
UMAP has no computational restrictions on embedding dimensions (McInnes¹³ et al., 2018). Therefore, it can be used effectively with different document embedding strategies.
Reducing embedding dimensionality with UMAP improves the clustering performances of K-Means and HDBSCAN in terms of accuracy and time (Allaoui¹⁴ et al., 2020).
UMAP can easily scale to large datasets (Angelov¹², 2020).

2.6.3 Clustering

Both BERTopic and Top2Vec originally leveraged HDBSCAN (McInnes¹⁵ et al., 2017) as clustering algorithm.

Pros:

HDBSCAN inherits the benefits of DBSCAN and improves it (McInnes¹⁵ et al., 2017).
HDBSCAN (as DBSCAN) does not force observations into a cluster. It models unrelated observations as outliers. This improves topic representation and coherence.

Cons:

Modeling unrelated documents as outliers may result in information loss. Outliers may become a relevant portion of the original corpus in noisy datasets.

2.6.4. Topic representation

BERTopic and Top2Vec differ from each other in how they manufacture a representation for the topics.

This score is called class-based TF-IDF (c TF-IDF), as it estimates the importance of words in clusters instead of documents.

Pros of BERTopic and Top2Vec:

The number of topics is not necessarily given beforehand. Both BERTopic and Top2Vec support for hierarchical topic reduction to optimize the number of topics.
High-quality embeddings take into account the semantic relationship between words in a corpus, unlike the bag-of-words approach. This leads to better and more informative topics.
Due to the semantic nature of embeddings, textual pre-processing (stemming, lemmization, stopwords removal, …) is not needed in most cases.
BERTopic supports dynamic topic modeling.
Modularity. Each step (document embedding, dimensionality reduction, clustering) is virtually self-consistent and can change or evolve depending on the advancements in the field, the peculiarities of a specific project or technical constraints. One may, for example, use BERTopic with Doc2Vec embeddings instead of SBERT, or apply K-Means clustering instead of HDBSCAN.
They scale better with larger corpora compared with conventional approaches (Angelov¹², 2020).
Both BERTopic and Top2Vec provide advanced built-in search and visualization capabilities. They make simpler to investigate the quality of the topics and drive further optimization, as well as producing high-quality charts for presentations.

Cons of BERTopic and Top2Vec:

They work better on shorter text, such as social media posts or news headlines. Most transformers-based embeddings have a limit on the number of tokens they can consider when they build a semantic representation. It is possible to use these algorithms with longer documents. One may, for example, split the documents in sentences or paragraphs before the embedding step. Nevertheless, this may not necessarily benefit the generation of meaningful and representative topics for longer documents.
Each document is assigned to one topic only. Traditional approaches like LDA, instead, were built on the assumption that each document contains a mixture of topics.
They are slower compared to conventional models (Grootendorst¹¹, 2022). Moreover, a faster training and inference may require more expensive hardware accelerators (GPU).
Although BERTopic leverages transformers-based large language models to manufacture document embeddings, the topic representation still uses a bag-of-word approach (c TF-IDF).
They may be less effective for small datasets (<1000 docs) (Egger¹⁶ et al., 2022).

Practical example with BERTopic

Practical example with Top2Vec

The following table summarizes the salient traits of the different topic modeling strategies considering practical application scenarios:

This summary table provides high-level selection criteria for a given use case.

Let’s share some examples.

When discussing topic modeling, there are two major points of attention worth mentioning at the end of our journey.

4.1 A topic is not (necessarily) what we think it is

Unfortunately, the term “topic” assumes a completely different meaning in the models discussed so far.

None of these models ensure that the obtained topics will be informative or useful from a human point of view.

In the beautiful words of Blei³ et al. (2013):

“We refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but we make no epistemological claims regarding these latent variables beyond their utility in representing probability distributions on sets of words.”

Make no mistake: this is an outstanding and unique result that opened an entirely new frontier in the field and achieved unprecedented performances.

But we may still debate on how this approximates the human definition of topic, and under what circumstances.

This leads us to the second point of attention.

4.2 Topics are not easy to evaluate

Topic modeling is an unsupervised technique. There are no labels to rely on during evaluation.

NPMI can vary from -1 (no co-occurrences) to +1 (perfect co-occurrence). Independence between the occurrences of x and y results in NPMI=0.

Lau¹⁸ et al. (2014) suggest that this metric emulates human judgement to a reasonable extent.

Other coherence measures also exist. For example, Cv (Röder¹⁹ et al. 2015) and UMass (Mimno²⁰ et al., 2011).

These coherence metrics suffer from a series of shortcomings:

There is no shared convention on which metric to use for qualitative performance (Zuo²¹ et al., 2016, Blair²² et al., 2020, Doogan²³ et al., 2021).
Blair²² et al., (2020) reported inconsistent results between different coherence measures.
Doogan²³ et al. (2021) showed the unreliability of coherence measures for evaluating topic models for specific domains (Twitter data).
Hoyle²⁴ et al., 2021 suggests that metrics like NPMI may fail to assess interpretability of neural topic models.
The use of Cv was discouraged²⁵ by its author due to reported inconsistencies.