Techno Blender
Digitally Yours.

Dissecting the Reach and Impact of Twitter’s Top Voices | by John Adeojo | Apr, 2023

0 64


Mapping the Twitter Influence Landscape with Data Science

A deep-dive into the relationships and patterns that shape the Twitterverse’s most powerful voices

Image by Author generated with Midjourney prompt “a Baroque style masterpiece depicting the state of Twitter in 2023”

Twitter has been at the centre of numerous controversies because of the influence of some of the largest accounts on the platform. The top 100 twitter accounts (by following) have an estimated total following of around 4.1 billion. We have seen how the top voices of Twitter can sway political opinion, impact financial markets, set trends, and even stir up hatred. Naturally, as a data scientist I was curious about what patterns could be revealed through a deep dive analysis over their tweets.

The rest of this post delves into my efforts to comprehend the nature of the influence of these accounts by examining the relationships between them. I will use clustering algorithms and statistical analysis to uncover patterns within and across clusters. I hope to gain a deeper understand of the nature of the influence these top voices have.

The source of the top 100 twitter accounts by number of followers is Social Blade (March-2023).

Disclaimer: This analysis is purely exploratory and should not be considered definitive. The author has no affiliation with the account owners mentioned and the insights provided are not endorsed by them.

I like sharing applications alongside my projects to make things more interactive and engaging for the reader. I have built the clustering and statistical analysis into a web that the reader can experiment with. You can adjust the hyperparameters of the clustering algorithm and see for ourself how they would impact the analysis. I would recommend reading on before doing this to familiarise yourself with the approach.

💻Account clustering app — This will work on a mobile, but it’s best viewed on a computer.

Note: The app doesn’t compile on the Streamlit server exactly as it does locally because of library dependency issues. There is a slight misalignment between the cluster labels on the app versus what is shown in the blog post, however clusters are still the same.

I used the Twitter API to extract the latest 100 publicly available tweets from each account. Twitter data is messy so I conducted the usual pre-processing of removing URLs before progressing with the analysis.

Retweets were removed from the analysis so the focus is purely original tweets by each account. This does leave some accounts having less than 100 tweets. Limitations around the Twitter API make it difficult to rectify this.

Please note that this is within Twitter’s terms of service. They allow analysis, aggregation of publicly available data via their API. The data is permitted for both non-commercial and commercial use.

  1. Defining Influence: How I defined and quantified influence.
  2. Dimension Reduction & Clustering: Using UMAP and HDBSCAN algorithms to reveal hidden structures in the data.
  3. Statistical Analysis: Deep dive statistical analysis into the clusters.
  4. Observations: Commentary on the results of the clustering and statistical analysis.
  5. Limitations & Extensions: A brief discussions of the limitations of this approach and how it could be extended.

The first step was to define exactly what is meant by influence. I conceptualised it into two overarching categories, engagement and impact. Engagement estimates the interaction of an account with their followers through the tweets shared. Impact estimates the emotion and/or sentiment behind the tweets.

Engagement metrics

All engagement metrics are adjusted by the number of followers the account had at the time they shared the tweet and then normalised.

  • Favourite: The number of times the tweet was favourited.
  • Re-tweet: The number of times the tweet was re-tweeted.
  • Quote: The number of times the tweet was quoted.
  • Reply: The number of times the tweet was replied to.
Engagement metrics

Impact metrics

Impact metrics were subcategorised into sentiment and emotion distributions by running classification models over each tweet. Following this, the distributions across each sub-category were calculated at the account level. For sentiment the sub-categories are; Positive, Negative, and Neutral. For emotion they are; Anger, Joy, Optimism, and Sadness.

Module for running sentiment/emotion models across tweets in a data frame.

Note the module TextCleaner is a module I created to remove URLs from tweets.

Run sentiment and emotion analysis on each tweet and create distributions at account level

Output from the sentiment and emotion detection. Each user_id refers to a single account. This is joined back on to the account level metrics table prior to clustering.

Image by Author: Sentiment and emotion distributions at the account level

A note on the use of large language models

I used the Twitter-roBERTa-base for Emotion and sentiment classification. Researchers have pretrained the models on around 60M tweets and fine-tuned for the emotion recognition and sentiment analysis separately¹. The training corpus consisted of tweets (in English) automatically labelled by Twitter. The researchers have published performance metrics for the model across all the tasks.

Image taken from Barbieri (2020)¹

The purpose of dimension reduction and clustering is to reveal hidden structures, and therefore relationships between accounts with respect to the influence metrics defined. I’ll go on to describe in detail how I did this.

Tallying up all the influence metrics we have a total of 12 dimensions to cluster over. Clustering over a high dimensional space like this can obscure patterns (see the curse of dimensionality), and not to mention that it is impossible to visualise. I addressed this with UMAP reducing the 12 dimensions to just two.

Dimension reduction with UMAP

At a high level, UMAP uses graph layout algorithms to reduce data from a higher dimensional space to a lower dimensional space while maintaining as much structural similarity as possible.

You can think of this as preserving the ‘information’ from the high dimensional space in a lower dimension.

UMAP won’t perfectly retain all the information from the 12 dimensions, but with the right choice of hyperparameters it retains enough to give us some insight into the structure of the data. Selecting the right UMAP parameters was more of an art than a science, mainly I adjusted parameters until I formed what appeared to be coherent clusters considering things like sample size. I’ll give a brief explanation of each hyperparameter and its effect on the clustering.

Here’s a cool resource to help you better understand UMAP

  • n_neighbours: Determines the number of nearest neighbours considered for each data point when constructing a high-dimensional graph. It adjusts local and global structure in the data. A smaller value prioritises the local structure and results in more detailed clusters, while a larger value will emphasise the global structure and leads to more connected and less distinct clusters.
  • min_distance: Controls the minimum distance between data points in the low-dimensional embedding (constructed by the UMAP algo). It determines how tightly clustered the points are. A smaller value generated more compact clusters, preserving local structure. A larger value spreads out the clusters, making it easier to visualise global relationships but potentially losing some finer details.
  • Distance Metric: The UMAP algorithm uses a distance metric to calculate the minimum distance between data points. Since the 12 dimensions are continuous and normalised (with values ranging between 0 and 1), I chose to use the Euclidean distance. This choice is appropriate because Euclidean distance measures the straight-line distance between two points, effectively capturing the relationships in the normalised dataset.

Clustering with HDBSCAN

After reducing the dimensionality of the metric space, I was able to apply clustering and subsequently data visualisation. For clustering I leveraged the HDBSCAN algorithm, a density based clustering algorithm that can work effectively on the two dimensional space.

At a high level HDBSCAN transforms the space between data points according to density, builds a the minimum spanning tree of the distance weighted graph, forms cluster hierarchies, and then extracts stable clusters. I used HDBSCAN because of its robustness and simplicity. There is only one (important) hyperparameter to adjust — the minimum cluster size.

The creators of the HDBSCAN library provide extensive documentation for you to learn more about the inner working of the algorithm.

Module I wrote to generate clusters:

Module for performing UMAP dimension reduction and Clustering
Run cluster analysis

Running the clustering on the account data generated six stable clusters. I’ll use some statistical analysis to investigate these clusters in the next section.

Image by Author: account clusters generated by the cluster analysis

You can experiment with the clustering yourself through the web app I built showcasing the analysis.

I conducted some deep dive analysis across clusters to uncover hidden relationships. All statistical analysis is assessed to a significance level of 0.05.

Who’s in the clusters?

At this point I think it makes sense to list out the individual accounts within each cluster. This should provide some context for the analysis proceeding this.

Cluster 0: The political cluster — Mainly politicians and world leaders. Surprisingly Emma Watson and Bill Gates have been included in this cluster…could the model be predicting something about their future career ambitions? Both are known to be fairly politically active so this might be coming out in their tweets.

5        Barack Obama
6 Joe Biden
48 Bill Gates
72 Emma Watson
87 PMO India
92 Hillary Clinton
94 Amit Shah
95 President Biden

Cluster 1: News Media (mainly) — This is mainly news outlets but there are also some celebrities in there. Some of which are known for their controversial tweets — could this be why the model has clustered them with news media?

0       CNN Breaking News
1 BBC News (World)
2 CNN
3 Twitter
4 The New York Times
7 Reuters
9 BBC Breaking News
10 The Economist
19 National Geographic
26 Wiz Khalifa
33 Kourtney Kardashian
34 Donald J. Trump
45 Nicki Minaj
46 Elon Musk
62 Conan O'Brien
91 Cardi B

Cluster 2: Mainly Sports Orgs— Interestingly enough quite a few sports-related accounts have been included in this cluster: I think all but two are in here. However, it’s difficult to classify cluster 2 as just sport as there are a large number of other types of accounts.

Note: referring to teams and organisations not the athletes themselves.

8                      ESPN
11 YouTube
12 PlayStation
13 NASA
15 Real Madrid C.F.
24 NFL
25 NBA
36 SportsCenter
38 Drizzy
39 Justin Bieber
43 SpaceX
56 FC Barcelona
73 BIGHIT MUSIC
77 Adele
79 Whindersson
80 netflixbrasil
82 Miley Cyrus
90 UEFA Champions League
93 BTS_official

Cluster 3: Pop Stars & Talk Show Hosts — Cluster 3 is dominated by pop stars but has also captured some TV personalities, Oprah Winfrey and Ellen DeGeneres to name a few. One could broadly classify this cluster under entertainment. There is only one organisation in here, Instagram, so we could unofficially classify this as the “A-lister” cluster.

16         Jimmy Fallon
17 Ellen DeGeneres
20 Taylor Swift
21 PRIYANKA
23 Oprah Winfrey
28 Demi Lovato
29 KATY PERRY
30 LeBron James
32 Selena Gomez
37 Justin Timberlake
42 Khloé
47 Shakira
51 Rihanna
57 Bruno Mars
58 Shah Rukh Khan
61 Hrithik Roshan
63 Lil Wayne WEEZY F
69 Kendall
70 Liam
71 Neymar Jr
74 zayn
75 Instagram
78 One Direction
85 Shawn Mendes

Cluster 4: Athletes — Cluster 4 is primarily athletes, mainly footballers and cricketers. This cluster is almost entirely uniform with the exception of Google, Britney Spears and Neil Patrick Harris.

18      Britney Spears 🌹🚀
22 Narendra Modi
27 Google
49 Kaka
50 Virat Kohli
54 Neil Patrick Harris
55 Andrés Iniesta
66 Sachin Tendulkar
67 Amitabh Bachchan
68 Cristiano Ronaldo
86 Arvind Kejriwal
88 Mesut Özil

Cluster 5: Mainly pop stars — The mode appears to be pop stars but there are also a fair few famous personalities and actors in here. Obvious outliers are Manchester United and Premier League. Similar to cluster three you could broadly classify this as entertainment.

14            Lady Gaga
31 Kevin Hart
35 Kim Kardashian
40 P!nk
41 Akshay Kumar
44 Alicia Keys
52 Louis Tomlinson
53 jlo
59 Deepika Padukone
60 Niall Horan
64 Chris Brown
65 Salman Khan
76 Harry Styles.
81 Kylie Jenner
83 방탄소년단
84 Premier League
89 Manchester United

Chi square analysis of sentiment & emotion

Are there any associations between clusters and sentiment/emotion distributions? To answer this question I ran chi square tests to see if there was a statistically significant association. The chi square test compares the expected distribution of sentiment/emotion if they were just randomly allocated across clusters with what was actually observed in the data. I calculated standardised residuals over each cluster and sentiment/emotion combination to measure how far each observation deviates from the expected values (displayed by the heat maps). You can loosely interpret higher standardised residuals as an indication that a cluster has a ‘greater’ propensity for a specific emotion or sentiment.

Chi-square statistic: 2291.23, p values less than 0.05

Image by Author: Standard residuals of the sentiment by cluster

Chi-square statistic: 1535.78, p values less than 0.05

Image by Author: Standard residuals of the emotions by cluster

Statistical analysis of engagement metrics

I wanted to understand how the engagement metrics were distributed across the clusters. I did this by performing a series of statistical tests.

Kruskal-Wallis Test: I conducted an initial Kruskal-Wallis test across each engagement metric to understand if there were statistically significant differences in the medium value of each engagement metric between the clusters.

Note: The Kruskal-Wallis test assumes the all comparison distributions are the same shape and that samples are random and independent². A quick examination of the box plots reveals most distributions are long-tailed. Kruskal-Wallis test was applied on log transformed follower adjusted engagement metrics.

Image by Author: shapes of engagement metric distributions

The results of are Kruskal-Wallis test indicate that there are statistically significant differences in the median values of our engagement metrics across clusters. However it does not tell us which clusters are different from one another. I performed some post-hoc statistical tests to determine pairwise cluster differences.

note: p-value less than 0.05 indicate statistical significance

favorites
Kruskal-Wallis H-test statistic: 895.7382389032183
P-value: 2.225650711196283e-191
retweets
Kruskal-Wallis H-test statistic: 767.6074631334371
P-value: 1.1759288846534128e-163
quotes
Kruskal-Wallis H-test statistic: 440.94518399410543
P-value: 4.4087653321001564e-93
replies
Kruskal-Wallis H-test statistic: 595.2647170442586
P-value: 2.1329465806092327e-126

Dunn’s Test: Dunn’s test is a non-parametric post-hoc test used to perform pairwise multiple comparisons following a Kruskal-Wallis test. It compares the differences in the ranked data between pairs of groups while adjusting for multiple comparisons³. The main assumptions for Dunn’s test are:

  1. Independent observations: The observations in each group should be independent of each other.
  2. Ordinal or continuous data: The data should be ordinal or continuous in nature.
  3. Homogeneity of variances (dispersion) across groups: Although Dunn’s test is less sensitive to the assumption of homogeneity of variances than parametric tests like ANOVA, it still assumes that the dispersion of the data is similar across groups. If this assumption is violated, the results of the test may be less reliable.
  4. Random sampling: The data should be obtained through random sampling from the populations of interest.

Dunn’s test tells if the median value of the clusters are significantly different. Where the p-value is “NaN” there was not a significant difference in the medians. For brevity I have only shown the results of Dunn’s test for follower adjusted favorites.

favorite_count_pf
Significant differences in Dunn's test (adjusted p-values):
0 1 2 3 4 \
0 NaN 8.478780e-39 NaN 1.738409e-10 6.224105e-06
1 8.478780e-39 NaN 1.313445e-46 1.228485e-155 1.879287e-87
2 NaN 1.313445e-46 NaN 3.520918e-23 6.119853e-12
3 1.738409e-10 1.228485e-155 3.520918e-23 NaN NaN
4 6.224105e-06 1.879287e-87 6.119853e-12 NaN NaN
5 3.376921e-07 8.379384e-119 9.390205e-16 NaN NaN

5
0 3.376921e-07
1 8.379384e-119
2 9.390205e-16
3 NaN
4 NaN
5 NaN

I have generated heatmaps across all the engagement metrics showing the direction of the statistically significant median differences. The differences are in the units count per follower for ease of interpretation. Differences are expressed as the row minus column.

Image by Author: Difference in medians for follower adjusted favorite and retweets (difference expressed as row (i) minus column (j))
Image by Author: Difference in medians for follower adjusted quote and reply (difference expressed as row (i) minus column (j))

Cluster 0 — political cluster

Emotion & Sentiment: What stands out immediately is the strong standard residual of anger (27). In my opinion this isn’t too surprising given the prominence of politicians in the cluster — notice the optimism is low here too. Might this change depending on the broader economic and political climate? At the time of writing we have seen the collapse of Silicone Valley Bank, mass layoffs in the tech industry, increase cost of living. The anger and low optimism could be a reflection of this.

Engagement Metrics: The median values for follower adjusted quotes and replies are (almost) consistently higher than the other clusters, with the exception of cluster 3. If I was to guess, I would imagine that this observation is largely driven by political debate. From the box plots we can see outliers are lower in number for cluster 0, this could be in part due to the smaller size of this cluster relative to the others.

Image by Author: Word cloud for cluster 0

Cluster 1 — News Media:

Emotion & Sentiment: The standard residuals for negative and positive sentiment are 16 and -17 respectively. This makes sense given the cluster is mainly news media outlets. Strangely there is a high positive residual for joy and sadness. This might indicate some misalignment between the sentiment model and emotion model, although it’s hard to say from this analysis alone. The neutrality is also strongly positive which might be expected when communicating things like financial news. The presence of controversial public personalities might also be contributing to the residuals observed.

Engagement Metrics: Cluster 1 has a consistently lower median engagement relative to the others across the board. This may be to do with the strong neutrality perhaps driven by the news outlets.

Image by Author: world cloud for cluster 1

Cluster 2 — Mainly Sports Media

Emotion & Sentiment: There is a strong neutral standard residual across this cluster. It also has the lowest negative residual and the second lowest anger residual, counterbalanced by the second highest optimism residual. For sports media, optimism makes sense as I would imagine there is a lot of promotional material.

Engagement Metrics: Similar to cluster 1 and interestingly they both share a strong neutral residual. The fact that these clusters are somewhat dominated by news/media and not people might be driving the lower engagement.

Image by author: Word cloud for cluster 2

Cluster 3 — Pop Stars & Talk Show Hosts

Emotion & Sentiment: Cluster 3 is heavy on entertainment. The residuals are stronger in the direction of positive sentiment and optimism. Perhaps this makes sense as entertainment is typically supposed to be light hearted and fun.

Engagement Metrics: Cluster 3 has the highest median engagement across the board. Looking at the box plots we can see that outliers are greater in number and magnitude than the other clusters. Maybe this shouldn’t come as a surprise looking at the names within the cluster, it would appear to have the highest concentration of A-list celebrity accounts of all the clusters. The outliers give us some indication that tweets from this cluster have a higher propensity for going “viral” comparatively speaking. I would caution that this could simply be a sampling effect giving the larger size of cluster 3 relative to the others. However larger median differences for engagement are somewhat indicative of virality.

Image by author: Word cloud for cluster 3

Cluster 4 — Athletes

Emotion & Sentiment: Predominantly high positivity, some anger, some optimism, but crucially low sadness. I think this captures the high energy nature of athletes pretty well.

Engagement Metrics: Median follower adjusted retweets, favourites, and replies were higher than all clusters with the exception of cluster 3 and 0. This might be related to the type of content being shared. There might be more video content in this cluster for example. The median quotes are also lower than other clusters which is perhaps more evidence of video and image content over written.

Image by author: Word cloud for cluster 4

Cluster 5 — Mainly popstars

Emotion & Sentiment: The cluster has the lowest anger, and sadness along with the highest optimism and positive sentiment residual. It shares similarities with cluster 3.

Engagement Metrics: Cluster 5 is a mixed bag but follows similar patterns to the other entertainment related clusters albeit slightly weaker effects from an engagement metric point of view. This cluster is analogous to cluster 3 in that there are some large outliers across the engagement metrics, comparatively speaking. One could argue that cluster 5 is really just a subset of cluster three looking at it’s engagement and sentiment.

Image by author: Word cloud for cluster 5

Before closing I should address the limitations of this approach of which there are several.

  1. The clusters are not necessarily temporally stable since we are only looking at the last 100 tweets across each account. The way an account tweets can change over time depending on complex external circumstances. The analysis could be extended to capture temporal relationships.
  2. The approach relies on pre-trained classification models. Although they have performed well against state of the art bench marks, they are by no means perfect. Saying this, it would be interesting to include more impact metrics like irony, hate speech detection.
  3. Clusters can change depending on the hyperparameters the analyst selects. It’s always good to pair this type of unsupervised machine learning with subject matter expertise to ensure clusters formed make sense. It would be interesting to collaborate with a social media experts to get their view on the insights generated.

Thanks for reading.

Follow me on LinkedIn

[1] Barbieri, Camacho-Collados, Neves, & Espinosa-Anke (2020, October 26). TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification. Snap Inc., Santa Monica, CA 90405, USA & School of Computer Science and Informatics, Cardiff University, United Kingdom. Retrieved from {https://arxiv.org/pdf/2010.12421.pdf}

[2] Ostertagova, E., Ostertag, O., & Kováč, J. (2014). Methodology and Application of the Kruskal-Wallis Test. Applied Mechanics and Materials, 611, 115–120.

[3] Dinno, A. (2015). Nonparametric Pairwise Multiple Comparisons in Independent Groups using Dunn’s Test. Stata Journal, 15, 292–300. https://doi.org/10.1177/1536867X1501500117




Mapping the Twitter Influence Landscape with Data Science

A deep-dive into the relationships and patterns that shape the Twitterverse’s most powerful voices

Image by Author generated with Midjourney prompt “a Baroque style masterpiece depicting the state of Twitter in 2023”

Twitter has been at the centre of numerous controversies because of the influence of some of the largest accounts on the platform. The top 100 twitter accounts (by following) have an estimated total following of around 4.1 billion. We have seen how the top voices of Twitter can sway political opinion, impact financial markets, set trends, and even stir up hatred. Naturally, as a data scientist I was curious about what patterns could be revealed through a deep dive analysis over their tweets.

The rest of this post delves into my efforts to comprehend the nature of the influence of these accounts by examining the relationships between them. I will use clustering algorithms and statistical analysis to uncover patterns within and across clusters. I hope to gain a deeper understand of the nature of the influence these top voices have.

The source of the top 100 twitter accounts by number of followers is Social Blade (March-2023).

Disclaimer: This analysis is purely exploratory and should not be considered definitive. The author has no affiliation with the account owners mentioned and the insights provided are not endorsed by them.

I like sharing applications alongside my projects to make things more interactive and engaging for the reader. I have built the clustering and statistical analysis into a web that the reader can experiment with. You can adjust the hyperparameters of the clustering algorithm and see for ourself how they would impact the analysis. I would recommend reading on before doing this to familiarise yourself with the approach.

💻Account clustering app — This will work on a mobile, but it’s best viewed on a computer.

Note: The app doesn’t compile on the Streamlit server exactly as it does locally because of library dependency issues. There is a slight misalignment between the cluster labels on the app versus what is shown in the blog post, however clusters are still the same.

I used the Twitter API to extract the latest 100 publicly available tweets from each account. Twitter data is messy so I conducted the usual pre-processing of removing URLs before progressing with the analysis.

Retweets were removed from the analysis so the focus is purely original tweets by each account. This does leave some accounts having less than 100 tweets. Limitations around the Twitter API make it difficult to rectify this.

Please note that this is within Twitter’s terms of service. They allow analysis, aggregation of publicly available data via their API. The data is permitted for both non-commercial and commercial use.

  1. Defining Influence: How I defined and quantified influence.
  2. Dimension Reduction & Clustering: Using UMAP and HDBSCAN algorithms to reveal hidden structures in the data.
  3. Statistical Analysis: Deep dive statistical analysis into the clusters.
  4. Observations: Commentary on the results of the clustering and statistical analysis.
  5. Limitations & Extensions: A brief discussions of the limitations of this approach and how it could be extended.

The first step was to define exactly what is meant by influence. I conceptualised it into two overarching categories, engagement and impact. Engagement estimates the interaction of an account with their followers through the tweets shared. Impact estimates the emotion and/or sentiment behind the tweets.

Engagement metrics

All engagement metrics are adjusted by the number of followers the account had at the time they shared the tweet and then normalised.

  • Favourite: The number of times the tweet was favourited.
  • Re-tweet: The number of times the tweet was re-tweeted.
  • Quote: The number of times the tweet was quoted.
  • Reply: The number of times the tweet was replied to.
Engagement metrics

Impact metrics

Impact metrics were subcategorised into sentiment and emotion distributions by running classification models over each tweet. Following this, the distributions across each sub-category were calculated at the account level. For sentiment the sub-categories are; Positive, Negative, and Neutral. For emotion they are; Anger, Joy, Optimism, and Sadness.

Module for running sentiment/emotion models across tweets in a data frame.

Note the module TextCleaner is a module I created to remove URLs from tweets.

Run sentiment and emotion analysis on each tweet and create distributions at account level

Output from the sentiment and emotion detection. Each user_id refers to a single account. This is joined back on to the account level metrics table prior to clustering.

Image by Author: Sentiment and emotion distributions at the account level

A note on the use of large language models

I used the Twitter-roBERTa-base for Emotion and sentiment classification. Researchers have pretrained the models on around 60M tweets and fine-tuned for the emotion recognition and sentiment analysis separately¹. The training corpus consisted of tweets (in English) automatically labelled by Twitter. The researchers have published performance metrics for the model across all the tasks.

Image taken from Barbieri (2020)¹

The purpose of dimension reduction and clustering is to reveal hidden structures, and therefore relationships between accounts with respect to the influence metrics defined. I’ll go on to describe in detail how I did this.

Tallying up all the influence metrics we have a total of 12 dimensions to cluster over. Clustering over a high dimensional space like this can obscure patterns (see the curse of dimensionality), and not to mention that it is impossible to visualise. I addressed this with UMAP reducing the 12 dimensions to just two.

Dimension reduction with UMAP

At a high level, UMAP uses graph layout algorithms to reduce data from a higher dimensional space to a lower dimensional space while maintaining as much structural similarity as possible.

You can think of this as preserving the ‘information’ from the high dimensional space in a lower dimension.

UMAP won’t perfectly retain all the information from the 12 dimensions, but with the right choice of hyperparameters it retains enough to give us some insight into the structure of the data. Selecting the right UMAP parameters was more of an art than a science, mainly I adjusted parameters until I formed what appeared to be coherent clusters considering things like sample size. I’ll give a brief explanation of each hyperparameter and its effect on the clustering.

Here’s a cool resource to help you better understand UMAP

  • n_neighbours: Determines the number of nearest neighbours considered for each data point when constructing a high-dimensional graph. It adjusts local and global structure in the data. A smaller value prioritises the local structure and results in more detailed clusters, while a larger value will emphasise the global structure and leads to more connected and less distinct clusters.
  • min_distance: Controls the minimum distance between data points in the low-dimensional embedding (constructed by the UMAP algo). It determines how tightly clustered the points are. A smaller value generated more compact clusters, preserving local structure. A larger value spreads out the clusters, making it easier to visualise global relationships but potentially losing some finer details.
  • Distance Metric: The UMAP algorithm uses a distance metric to calculate the minimum distance between data points. Since the 12 dimensions are continuous and normalised (with values ranging between 0 and 1), I chose to use the Euclidean distance. This choice is appropriate because Euclidean distance measures the straight-line distance between two points, effectively capturing the relationships in the normalised dataset.

Clustering with HDBSCAN

After reducing the dimensionality of the metric space, I was able to apply clustering and subsequently data visualisation. For clustering I leveraged the HDBSCAN algorithm, a density based clustering algorithm that can work effectively on the two dimensional space.

At a high level HDBSCAN transforms the space between data points according to density, builds a the minimum spanning tree of the distance weighted graph, forms cluster hierarchies, and then extracts stable clusters. I used HDBSCAN because of its robustness and simplicity. There is only one (important) hyperparameter to adjust — the minimum cluster size.

The creators of the HDBSCAN library provide extensive documentation for you to learn more about the inner working of the algorithm.

Module I wrote to generate clusters:

Module for performing UMAP dimension reduction and Clustering
Run cluster analysis

Running the clustering on the account data generated six stable clusters. I’ll use some statistical analysis to investigate these clusters in the next section.

Image by Author: account clusters generated by the cluster analysis

You can experiment with the clustering yourself through the web app I built showcasing the analysis.

I conducted some deep dive analysis across clusters to uncover hidden relationships. All statistical analysis is assessed to a significance level of 0.05.

Who’s in the clusters?

At this point I think it makes sense to list out the individual accounts within each cluster. This should provide some context for the analysis proceeding this.

Cluster 0: The political cluster — Mainly politicians and world leaders. Surprisingly Emma Watson and Bill Gates have been included in this cluster…could the model be predicting something about their future career ambitions? Both are known to be fairly politically active so this might be coming out in their tweets.

5        Barack Obama
6 Joe Biden
48 Bill Gates
72 Emma Watson
87 PMO India
92 Hillary Clinton
94 Amit Shah
95 President Biden

Cluster 1: News Media (mainly) — This is mainly news outlets but there are also some celebrities in there. Some of which are known for their controversial tweets — could this be why the model has clustered them with news media?

0       CNN Breaking News
1 BBC News (World)
2 CNN
3 Twitter
4 The New York Times
7 Reuters
9 BBC Breaking News
10 The Economist
19 National Geographic
26 Wiz Khalifa
33 Kourtney Kardashian
34 Donald J. Trump
45 Nicki Minaj
46 Elon Musk
62 Conan O'Brien
91 Cardi B

Cluster 2: Mainly Sports Orgs— Interestingly enough quite a few sports-related accounts have been included in this cluster: I think all but two are in here. However, it’s difficult to classify cluster 2 as just sport as there are a large number of other types of accounts.

Note: referring to teams and organisations not the athletes themselves.

8                      ESPN
11 YouTube
12 PlayStation
13 NASA
15 Real Madrid C.F.
24 NFL
25 NBA
36 SportsCenter
38 Drizzy
39 Justin Bieber
43 SpaceX
56 FC Barcelona
73 BIGHIT MUSIC
77 Adele
79 Whindersson
80 netflixbrasil
82 Miley Cyrus
90 UEFA Champions League
93 BTS_official

Cluster 3: Pop Stars & Talk Show Hosts — Cluster 3 is dominated by pop stars but has also captured some TV personalities, Oprah Winfrey and Ellen DeGeneres to name a few. One could broadly classify this cluster under entertainment. There is only one organisation in here, Instagram, so we could unofficially classify this as the “A-lister” cluster.

16         Jimmy Fallon
17 Ellen DeGeneres
20 Taylor Swift
21 PRIYANKA
23 Oprah Winfrey
28 Demi Lovato
29 KATY PERRY
30 LeBron James
32 Selena Gomez
37 Justin Timberlake
42 Khloé
47 Shakira
51 Rihanna
57 Bruno Mars
58 Shah Rukh Khan
61 Hrithik Roshan
63 Lil Wayne WEEZY F
69 Kendall
70 Liam
71 Neymar Jr
74 zayn
75 Instagram
78 One Direction
85 Shawn Mendes

Cluster 4: Athletes — Cluster 4 is primarily athletes, mainly footballers and cricketers. This cluster is almost entirely uniform with the exception of Google, Britney Spears and Neil Patrick Harris.

18      Britney Spears 🌹🚀
22 Narendra Modi
27 Google
49 Kaka
50 Virat Kohli
54 Neil Patrick Harris
55 Andrés Iniesta
66 Sachin Tendulkar
67 Amitabh Bachchan
68 Cristiano Ronaldo
86 Arvind Kejriwal
88 Mesut Özil

Cluster 5: Mainly pop stars — The mode appears to be pop stars but there are also a fair few famous personalities and actors in here. Obvious outliers are Manchester United and Premier League. Similar to cluster three you could broadly classify this as entertainment.

14            Lady Gaga
31 Kevin Hart
35 Kim Kardashian
40 P!nk
41 Akshay Kumar
44 Alicia Keys
52 Louis Tomlinson
53 jlo
59 Deepika Padukone
60 Niall Horan
64 Chris Brown
65 Salman Khan
76 Harry Styles.
81 Kylie Jenner
83 방탄소년단
84 Premier League
89 Manchester United

Chi square analysis of sentiment & emotion

Are there any associations between clusters and sentiment/emotion distributions? To answer this question I ran chi square tests to see if there was a statistically significant association. The chi square test compares the expected distribution of sentiment/emotion if they were just randomly allocated across clusters with what was actually observed in the data. I calculated standardised residuals over each cluster and sentiment/emotion combination to measure how far each observation deviates from the expected values (displayed by the heat maps). You can loosely interpret higher standardised residuals as an indication that a cluster has a ‘greater’ propensity for a specific emotion or sentiment.

Chi-square statistic: 2291.23, p values less than 0.05

Image by Author: Standard residuals of the sentiment by cluster

Chi-square statistic: 1535.78, p values less than 0.05

Image by Author: Standard residuals of the emotions by cluster

Statistical analysis of engagement metrics

I wanted to understand how the engagement metrics were distributed across the clusters. I did this by performing a series of statistical tests.

Kruskal-Wallis Test: I conducted an initial Kruskal-Wallis test across each engagement metric to understand if there were statistically significant differences in the medium value of each engagement metric between the clusters.

Note: The Kruskal-Wallis test assumes the all comparison distributions are the same shape and that samples are random and independent². A quick examination of the box plots reveals most distributions are long-tailed. Kruskal-Wallis test was applied on log transformed follower adjusted engagement metrics.

Image by Author: shapes of engagement metric distributions

The results of are Kruskal-Wallis test indicate that there are statistically significant differences in the median values of our engagement metrics across clusters. However it does not tell us which clusters are different from one another. I performed some post-hoc statistical tests to determine pairwise cluster differences.

note: p-value less than 0.05 indicate statistical significance

favorites
Kruskal-Wallis H-test statistic: 895.7382389032183
P-value: 2.225650711196283e-191
retweets
Kruskal-Wallis H-test statistic: 767.6074631334371
P-value: 1.1759288846534128e-163
quotes
Kruskal-Wallis H-test statistic: 440.94518399410543
P-value: 4.4087653321001564e-93
replies
Kruskal-Wallis H-test statistic: 595.2647170442586
P-value: 2.1329465806092327e-126

Dunn’s Test: Dunn’s test is a non-parametric post-hoc test used to perform pairwise multiple comparisons following a Kruskal-Wallis test. It compares the differences in the ranked data between pairs of groups while adjusting for multiple comparisons³. The main assumptions for Dunn’s test are:

  1. Independent observations: The observations in each group should be independent of each other.
  2. Ordinal or continuous data: The data should be ordinal or continuous in nature.
  3. Homogeneity of variances (dispersion) across groups: Although Dunn’s test is less sensitive to the assumption of homogeneity of variances than parametric tests like ANOVA, it still assumes that the dispersion of the data is similar across groups. If this assumption is violated, the results of the test may be less reliable.
  4. Random sampling: The data should be obtained through random sampling from the populations of interest.

Dunn’s test tells if the median value of the clusters are significantly different. Where the p-value is “NaN” there was not a significant difference in the medians. For brevity I have only shown the results of Dunn’s test for follower adjusted favorites.

favorite_count_pf
Significant differences in Dunn's test (adjusted p-values):
0 1 2 3 4 \
0 NaN 8.478780e-39 NaN 1.738409e-10 6.224105e-06
1 8.478780e-39 NaN 1.313445e-46 1.228485e-155 1.879287e-87
2 NaN 1.313445e-46 NaN 3.520918e-23 6.119853e-12
3 1.738409e-10 1.228485e-155 3.520918e-23 NaN NaN
4 6.224105e-06 1.879287e-87 6.119853e-12 NaN NaN
5 3.376921e-07 8.379384e-119 9.390205e-16 NaN NaN

5
0 3.376921e-07
1 8.379384e-119
2 9.390205e-16
3 NaN
4 NaN
5 NaN

I have generated heatmaps across all the engagement metrics showing the direction of the statistically significant median differences. The differences are in the units count per follower for ease of interpretation. Differences are expressed as the row minus column.

Image by Author: Difference in medians for follower adjusted favorite and retweets (difference expressed as row (i) minus column (j))
Image by Author: Difference in medians for follower adjusted quote and reply (difference expressed as row (i) minus column (j))

Cluster 0 — political cluster

Emotion & Sentiment: What stands out immediately is the strong standard residual of anger (27). In my opinion this isn’t too surprising given the prominence of politicians in the cluster — notice the optimism is low here too. Might this change depending on the broader economic and political climate? At the time of writing we have seen the collapse of Silicone Valley Bank, mass layoffs in the tech industry, increase cost of living. The anger and low optimism could be a reflection of this.

Engagement Metrics: The median values for follower adjusted quotes and replies are (almost) consistently higher than the other clusters, with the exception of cluster 3. If I was to guess, I would imagine that this observation is largely driven by political debate. From the box plots we can see outliers are lower in number for cluster 0, this could be in part due to the smaller size of this cluster relative to the others.

Image by Author: Word cloud for cluster 0

Cluster 1 — News Media:

Emotion & Sentiment: The standard residuals for negative and positive sentiment are 16 and -17 respectively. This makes sense given the cluster is mainly news media outlets. Strangely there is a high positive residual for joy and sadness. This might indicate some misalignment between the sentiment model and emotion model, although it’s hard to say from this analysis alone. The neutrality is also strongly positive which might be expected when communicating things like financial news. The presence of controversial public personalities might also be contributing to the residuals observed.

Engagement Metrics: Cluster 1 has a consistently lower median engagement relative to the others across the board. This may be to do with the strong neutrality perhaps driven by the news outlets.

Image by Author: world cloud for cluster 1

Cluster 2 — Mainly Sports Media

Emotion & Sentiment: There is a strong neutral standard residual across this cluster. It also has the lowest negative residual and the second lowest anger residual, counterbalanced by the second highest optimism residual. For sports media, optimism makes sense as I would imagine there is a lot of promotional material.

Engagement Metrics: Similar to cluster 1 and interestingly they both share a strong neutral residual. The fact that these clusters are somewhat dominated by news/media and not people might be driving the lower engagement.

Image by author: Word cloud for cluster 2

Cluster 3 — Pop Stars & Talk Show Hosts

Emotion & Sentiment: Cluster 3 is heavy on entertainment. The residuals are stronger in the direction of positive sentiment and optimism. Perhaps this makes sense as entertainment is typically supposed to be light hearted and fun.

Engagement Metrics: Cluster 3 has the highest median engagement across the board. Looking at the box plots we can see that outliers are greater in number and magnitude than the other clusters. Maybe this shouldn’t come as a surprise looking at the names within the cluster, it would appear to have the highest concentration of A-list celebrity accounts of all the clusters. The outliers give us some indication that tweets from this cluster have a higher propensity for going “viral” comparatively speaking. I would caution that this could simply be a sampling effect giving the larger size of cluster 3 relative to the others. However larger median differences for engagement are somewhat indicative of virality.

Image by author: Word cloud for cluster 3

Cluster 4 — Athletes

Emotion & Sentiment: Predominantly high positivity, some anger, some optimism, but crucially low sadness. I think this captures the high energy nature of athletes pretty well.

Engagement Metrics: Median follower adjusted retweets, favourites, and replies were higher than all clusters with the exception of cluster 3 and 0. This might be related to the type of content being shared. There might be more video content in this cluster for example. The median quotes are also lower than other clusters which is perhaps more evidence of video and image content over written.

Image by author: Word cloud for cluster 4

Cluster 5 — Mainly popstars

Emotion & Sentiment: The cluster has the lowest anger, and sadness along with the highest optimism and positive sentiment residual. It shares similarities with cluster 3.

Engagement Metrics: Cluster 5 is a mixed bag but follows similar patterns to the other entertainment related clusters albeit slightly weaker effects from an engagement metric point of view. This cluster is analogous to cluster 3 in that there are some large outliers across the engagement metrics, comparatively speaking. One could argue that cluster 5 is really just a subset of cluster three looking at it’s engagement and sentiment.

Image by author: Word cloud for cluster 5

Before closing I should address the limitations of this approach of which there are several.

  1. The clusters are not necessarily temporally stable since we are only looking at the last 100 tweets across each account. The way an account tweets can change over time depending on complex external circumstances. The analysis could be extended to capture temporal relationships.
  2. The approach relies on pre-trained classification models. Although they have performed well against state of the art bench marks, they are by no means perfect. Saying this, it would be interesting to include more impact metrics like irony, hate speech detection.
  3. Clusters can change depending on the hyperparameters the analyst selects. It’s always good to pair this type of unsupervised machine learning with subject matter expertise to ensure clusters formed make sense. It would be interesting to collaborate with a social media experts to get their view on the insights generated.

Thanks for reading.

Follow me on LinkedIn

[1] Barbieri, Camacho-Collados, Neves, & Espinosa-Anke (2020, October 26). TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification. Snap Inc., Santa Monica, CA 90405, USA & School of Computer Science and Informatics, Cardiff University, United Kingdom. Retrieved from {https://arxiv.org/pdf/2010.12421.pdf}

[2] Ostertagova, E., Ostertag, O., & Kováč, J. (2014). Methodology and Application of the Kruskal-Wallis Test. Applied Mechanics and Materials, 611, 115–120.

[3] Dinno, A. (2015). Nonparametric Pairwise Multiple Comparisons in Independent Groups using Dunn’s Test. Stata Journal, 15, 292–300. https://doi.org/10.1177/1536867X1501500117

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment