Cᵥ Topic Coherence Explained. Understanding the metric that… | by Emil Rijcken | Jan, 2023

By Jessie Hobb On Jan 16, 2023

Understanding the metric that correlates the highest with humans

In natural language processing (NLP), topic modeling is a popular task. The goal is to extract the hidden (K) topics in a corpus of documents. Each topic is a distribution over words. Typically, the N most probable words per topic represent that topic. The idea is that if the topic modeling algorithm works well, these top-N words are semantically related. The difficulty is how to evaluate these sets of words. Just as with any machine learning task, model evaluation is critical. In a large systematic study (Röder et al., 2015), Cᵥ coherence, which was unknown until then, was found to correlate the highest with human interpretation. Since then, this measure has been widely used to evaluate topic models and is the default setting in Gensim’s Coherencemodel. I couldn’t find an intuitive algorithm description online, so I wrote one. Many data scientists work with topic modeling to analyse their texts. Understanding the evaluation metrics will help in tuning both the model and evaluation metric.

We will start by discussing some background information. This will lead the way to the study by Röder et al.. After discussing the study approach, we will go through the algorithm mathematically. Then, we will walk through the steps with an example (about roses and violets). Lastly, I will share some observations and considerations when using theCᵥ coherence.

A long-standing problem with many potential applications far beyond NLP is quantifying the coherence of a set of statements. Quantitative approaches in NLP are typically inspired by the distributional hypothesis. Several researchers in the 1950s (Martin Joos, Zellig S. Harris and John Rupert Firth) found that synonyms (e.g. oculist and eye-doctor) tended to occur in the same environment of words (near words such as ‘eye’ or ‘examined’). Hence, the distributional hypothesis states that words in the same contexts tend to have similar meanings. For this reason, the word context in a corpus is often used to assess whether a set of topic words are coherent. However, there are many ways to obtain information about ‘the context’ and reflect this in a single score. E.g.:

Do you compare (all) single topic words or sets?
If you consider co-occurrence, do you consider this for a whole document or only a specific window of words? If you choose a window, how many words are in the window?
Given a comparison of words/word sets, how do you indicate how strongly these are semantically related?
Having collected the scores for all topics, how do you aggregate these numbers to reflect one score only?

Recognizing the considerations from above, Röder et al. (2015) define four dimensions that span the configuration space of coherence measures. Or in layman’s terms, they define coherence scores as a combination of the four considerations listed above. They call these dimensions:

Segmentation of word subsets,
Probability estimation,
Confirmation measure,
Aggregation.

Then, they define potential settings for each of these dimensions. Based on these settings, they can make 237 912 combinations. They calculate the correlation between human evaluation scores of given topics and each combination. From this systematic study, an unknown measure appeared to have the highest correlation; Cᵥ coherence.

In cᵥ coherence, each topic word is compared with the set of all topics. A boolean sliding window of size 110 is used to assess whether two words co-occur. Then, the confirmation measure consists of direct- and indirect confirmations. For all N most probable words per topic, a ‘word vector’ of size N is created in which each cell contains the Normalized Pointwise Mutual Information (NPMI) between that word and word i, i in{1,2,…, N}. Then, all the word vectors in a topic are aggregated into one big topic vector. The average of all the cosine similarities between each topic word and its topic vector (this is the segmentation) is used to calculate the Cᵥ score.

Now comes a formal description. Since Medium does not do so well with mathematical notation, I am pasting some notation in here from another document my group and I created.

The Cᵥ score is heavily based on the NPMI score, an advanced way to calculate the probability of two words co-occurring in a corpus.

The formula for Normalized Pointwise Mutual Information

The epsilon is a small constant used to avoid a logarithm of zero.

This probability is based on a sliding window s (s = 110 for Cᵥ). With j being the index of the sliding window in a document, the probabilities in the NPMI formula are calculated as follows:

The calculation of probabilities between two words, based on a sliding window s

Based on the NPMI, a word vector with length N is created for each topic word (direct confirmation):

The creation of a word vector (direct confirmation measure)

For the segmentation, we have the following:

The segmentation of word subsets

To compare each word with the topic vector, we create K topic vectors as the sum of all N words per topic (mind you that the W* refers to the same w* as below:

Based on the topic vector and segmentation, the cosine similarity is calculated for each topic word vector with the topic vector. The cosine similarity is calculated as follows:

Then, the average of all NxK cosine similarities is taken to calculate the Cᵥ score:

The Cᵥ score is the average of all cosine similarities.

The beautiful thing about understanding complex things is that once you understand them, they seem not so complex anymore. Having made it to the end of this blog, I hope this is the case for you. If not, we will go to plan B. Let’s walk through an example.

Furthermore, we have three topics, with two most probable words per topic (hence, K = 3, N = 2):

Using a sliding window size 3 (s = 3), we have the following windows:

(’mary’,’had’,’a’), (’had’,’a’,’lamp’), (’mary’,’has’,’roses’), (’roses’,’are’,’red’), (’violets’,’ are’,’light’),(’are’,’light’,’blue’).

Hence, we have:

Based on these scores, we have:

Hence:

And:

This gives us:

So that the Cᵥ score for these topics is calculated as:

Although Cᵥ came out as the best coherence score in Röder et al.’s study, there are some issues in practice. The Cᵥ score negatively correlates with other coherence measures in some experiments. However, the author suspects this issue might be caused by the value for epsilon, and the error is not further investigated in the thread. Neither am I aware of any peer-reviewed work discussing the issue. My conclusion is that until proven guilty, Cᵥ remains innocent. But it is good to be aware of this issue. Also, word-embedding-based topic models are commonly used now. The author claims these correlate to human judgment, similar to the Cᵥ score. However, in one of my studies, my group and I found different (embedding-based-) coherence scores favoured different topics. Hence, the last word about topic coherence remains to be spoken.

If you like this work, you are likely interested in topic modeling. In that case, you might be interested in the following as well.

We have created a new topic modeling algorithm called FLSA-W (the official page is here, but you can see the paper here).

FLSA-W outperforms other state-of-the-art algorithms (such as LDA, ProdLDA, NMF, CTM and more) on several open datasets. This work has been submitted but is not peer-reviewed yet.

If you want to use FLSA-W, you can download the FuzzyTM package or the flsamodel in Gensim. For citations, please use this paper.

Let me know if you have any questions or remarks.