An AI-Powered Analysis of our Postal Service Through Tweets | by John Adeojo | Mar, 2023

By Jessie Hobb On Mar 23, 2023

Deciphering Customer Voices with AI

Delving into Machine Learning, Topic Modeling, and Sentiment Analysis to Uncover Valuable Customer Perspectives

Image by Author: AI generated sentiment and topic for #royalmail

My partner and I usually experience an excellent postal service. Most of the time letters arrive to our home un-opened and delivered in a timely fashion. That’s why when our post didn’t arrive for a few weeks we thought it was quite strange. After some diligent web searching, we discovered the most likely cause to this service disruption was strikes. As a data scientist this whole episode got me thinking…

Is there a way to leverage online data to track these types of incidents?

The answer to this question is yes, and I have already built a prototype which is available for you to play with. I recommend doing so before reading on as it will give you a feel for things before getting into the technical details.

🌏 Explore the m(app)

I’ll spend the remainder of this write up walking you through how I went about answering this question. This is pretty much an end to end machine learning project exploring aspects of software engineering, social media data mining, topic modelling, transformers, custom loss functions, transfer learning, and data visualisation. If that sounds interesting to you at all grab a snack or a drink and get comfortable because this might be quite a long one but hopefully worth the read.

Disclaimer: This article is an independent analysis of tweets containing the #royalmail hashtag and is not affiliated with, endorsed, or sponsored by Royal Mail Group Ltd. The opinions and findings expressed within this article are solely those of the author and do not represent the views or official positions of Royal Mail Group Ltd or any of its subsidiaries.

When seeking to understand what people think, Twitter is always a good starting point. Much of what people post on Twitter is public and easily accessible through their API. It’s the kind of no holds barred verbal arena you would expect to find plenty of insights on customer service. I got curious and conducted a quick twitter search myself starting simply with ‘#royalmail’. And voila! a tonne of tweets.

With my data source identified, the next thing I did was to figure out how I would “mine” issues raised from those tweets. Topic modelling came to mind immediately as something to try. I figured that using some kind of clustering on the tweets could reveal some latent topics. I’ll spend the remainder of the write up going into some technical details. This won’t be a step-by-step, but rather a peek over my shoulder and a window into my thought process in putting this project together.

Development environment: I do the majority of my ML projects in python so my preferred IDE is Jupyter labs. I find it useful to be able to quickly toggle between Jupyter notebooks, python scripts, and the terminal.

File structure: This is a rather complex project, if I do say so myself. There are several processes to consider here and therefore it’s not something that could just be done from the safety of a Jupyter notebook. Listing out all of these we have; data extraction, data processing, topic modeling, machine learning, and data visualisation. To help create some order I usually start by establishing an appropriate file structure. You can, and probably should leverage bash scripting to do this.

│   README.md
│   setup.py
│   __init__.py
│
├───data
│   ├───01_raw
│   │       tweets_details2023-03-15_20-43-36.csv
│   │
│   ├───02_intermediate
│   ├───03_feature_bank
│   ├───04_model_output
│   └───05_Reports
├───data_processing
│       collect_tweets.py
│       preprocess_tweets_lite.py
│       preprocess_tweets_rm.py
│       __init__.py
│
├───machine_learning
│       customer_trainer.py
│       makemodel.py
│       preprocess_ml.py
│       train_models.py
│       __init__.py
│
├───notebooks
│       HDBSCAN_UMAP_notebook.ipynb
│       Twitter Model Analysis Notebook .ipynb
│
└───topic_modeling
bert_umap_topic.py
tfidf.py
twitter_roberta_umap_topic.py
__init__.py

Modularisation: I broke each process down into modules making it easy to re-use, adapt and tweak things for different use cases. Modules also help keep your code ‘clean’. Without the modular approach I would have ended up with a Jupyter notebook or python script thousands of lines long, very unappealing and difficult to de-bug.

Version control: With complex projects, you do not want to lose your progress, overwrite something important, or mess up beyond repair. GitHub is really the perfect solution for this as it makes it hard to mess up badly. I get started by creating a remote repo and cloning it to my local machine allowing me to sleep easy knowing all my hard work is backed up. GitHub desk top allows me to carefully track any changes before committing them back to the remote repository.

Packages: I leveraged a tonne of open source packages, I’ll list the key ones below and provide links.

Transformers: API for hugging face large language model.
Pytorch: Framework for building and customising transformers.
Streamlit: For building web applications.
Scikit Learn: Framework for machine learning.
UMAP: Open source implementation of the UMAP algorithm.
HDBSCAN: Open source implementation of the HDSCAN algorithm.
Folium: For geographic data visualisation.
CUDA: Parallel computing platform for leveraging the power of your GPU.
Seaborn: A library for data visualisation in python.
Pandas: A library for handling structured data.
Numpy: A library for performing numeric operations in python.

Environment management: Having access to a wealth of libraries on the internet is fantastic, but your environment can quickly run away with you. To manage this complexity I like to enforce upon myself a clean environment policy whenever I start a new project. It’s strictly one environment per project. I choose to use Anaconda as my choice of environment manager because of the flexibility it gives me.

note: for the purposes of this project I did create separate environments and GitHub repositories for the streamlet web application and the topic modeling.

I used the Twitter API to extract around 30k publicly available tweets searching #royalmail. I want to stress here that only data that is publicly available can be extracted with the Twitter API alleviating some of the data privacy concerns one may have.

Twitter data is incredibly messy and notoriously difficult to work with for any natural language processing (nlp) tasks. It’s social media data loaded with emoji’s, grammatical inconsistencies, special characters, expletives, URLS, and every other hurdle that comes with free form text. I wrote my own custom scripts to clean the data for this particular project. It was primarily getting rid of URLs and annoying stop words. I have given a snippet for the “lite” version, but I did also use a more heavy duty version during clustering.

Module for cleaning URLs from tweets

Please note that this is within Twitters terms of service. They allow analysis, aggregation of publicly available data via their API. The data is permitted for both non-commercial and commercial use.

The topic modelling approach I used draws inspiration from BERT topic¹. I had initially tried Latent Dirichlet Allocation , but struggled to get anything coherent. BERT topic was a great reference point, but I had noticed that it hadn’t explicitly been designed to extract topics from messy Twitter data. Following many of the same logical steps as BERT topic, I adapted the approach a little bit for the task.

At a high level BERT topic uses the BERT model to generate embeddings, performs dimensionality reduction and clustering to reveal latent topics in documents.

My approach leveraged the twitter-xlm-roberta-base² model to generate embeddings. This transformer has been pretrained on twitter data and captures all the messy nuances, emojis and all. Embeddings, are simply a way to represent sentences in numeric form such that both syntactical and semantical information is preserved. Embeddings are learnt by transformers through self-attention. The amazing thing about all the recent innovation in the large language model space is that one can leverage state-of-the-art models to generate embeddings for one’s own purposes.

I used the UMAP algorithm to project the tweet embeddings into a two dimensional space and HDBSCAN to identify clusters. Treating each cluster as a document, I generated TF-IDF scores to extract a list of key words that roughly ‘define’ each cluster forming my initial topics.

TF-IDF is a handy way to measure a word’s significance in a cluster, considering how often it appears in that specific cluster and how rare it is in a larger group of clusters. It helps identify words that are unique and meaningful in each cluster.

Some of these dimensionality reductions can be hard to make sense of at first. I found these resources useful for helping me get to grips with the algorithms.

Understanding UMAP — An excellent resource that helps you visualise and understand the impact of adjusting hyperparameters.

HDBSCAN Documentation — The most coherent explanation of HDBSCAN I could find was provided in the documentation itself.

Lastly, I tested the coherence of the topics generated by scoring the cosine similarity between the topics and the tweets themselves. This sounds rather formulaic on paper, but I can assure you this was no straight forward task. Unsupervised machine learning of this nature is just trial and error. It took me dozens of iterations and manual effort to find the right parameters to get coherent topics out of these tweets. So rather than going into the specifics of all the hyperparameters I used, I will just talk about the four critical ones that were really a make or break for this approach.

Distance metrics: for topic modelling the distance metric is really the difference between forming coherent topics and just generating a random list of words. For both UMAP and HDBSCAN I chose cosine distance. The choice here was a no-brainer considering my objective, to model topics. Topics are semantically similar groups of text, and the best way to measure semantic similarity is cosine distance.

Number of words: after generating the clusters I wanted to understand the “contents” of those clusters through TF-IDF. The key metric of choice here is how many words to return for each cluster. This could range from one to the number of unique words in the whole corpus of text. Too many words, and your topics become incoherent, too few and you end up with poor coverage of your cluster. Selecting this was a matter of trial and error, after several iterations I landed on 4 words per topic.

Scoring: Topic modelling isn’t an exact science, so some manual intervention is required to make sure topics made sense. I could do this for a few hundred or even a few thousand tweets, but tens of thousands? That’s not practically feasible. So I used a numeric “hack” by scoring the cosine similarity between the TFIDF topics generated and the tweets themselves. Again this was a lot of trial and error but after several iterations I found an appropriate cut off for cosine similarity to be around 0.9. This left me with around 3k from the original 30k that were fairly well categorised. Most importantly, it was a large enough sample size to do some supervised machine learning.

Topics in 2d: UMAP provides a convenient way to visualise the topics. What we can see is that there is a mass of topics in the centre that have been clustered together with some smaller niche topics on the edge. It actually reminds me a bit of a galaxy. After doing some detective work (manual trawling through spreadsheets) I found this to make sense. The mass of topics in the centre are mainly around customer service, often complaints. What I thought was particularly fascinating was the model’s ability to actually isolate very niche areas. These included politics, economics, employment, and philately (which isn’t some minor celebrity, but the collection of stamps!). Of course, topics returned by TFIDF were no where near this coherent, but I was able to identify 6 well categorised topics from the analysis. My final 6 topics were customer service, politics, royal reply, jobs, financial news, and philately.

List of four words topics generated by TF-IDF on the clusters and taking the 0.9+ cosine similarity to tweets.

apprenticeship, jinglejobs, job, label: Jobs
biggest, boss, revolt, year: Politics
birth, reply, royalletters, royalreply: Royal Reply
collecting, pack, philatelist, philately: Philately
declares, plc, position, short: Financial News
definitive, philatelist, philately, presentation: Philately
driving, infoapply, job, office: Jobs
driving, job, sm1jobs, suttonjobs: Jobs
ftse, rmg, share, stock: Financial News
germany, royal, royalletter, royalreply: Royal Reply
gradjobs, graduatescheme, jobsearch, listen: Jobs
labour, libdems, tory, uk: Politics
letter, mail, service, strike: Customer Service
luxembourg, royal, royalletter, royalreply: Royal Reply
new, profit, shareholder, world: Financial News
plc, position, reduced, wace: Financial News

Image by Author: A 2d representation of the embedding after applying UMAP

Image by Author: A view of the the topics after applying HDBSCAN. Yellow mass is customer service related

The topic modelling was fiddly and definitely not something you want to rely on continuously for generating insights. As far as I’m concerned it should be an exercise that you conduct once every few months or so (depending on the fidelity of your data), just in case anything new comes up.

Having performed the arduous task of topic modelling, I had some labels and a decent sized data set of just under 3k observations for training a model. Leveraging a pretrained transformer means not having to train from scratch, not having to build my own architecture and harnessing the power of the model’s existing knowledge.

Data Splitting

I proceeded with the standard Train, Validation, and Test splits with 80% of the observations being allocated to train. See script below:

Data splitting script

Implementing focal loss with a custom trainer

Model training turned out to be less straight forward than I had anticipated, and this wasn’t because of the hardware requirements but rather the data itself. What I was dealing with was a highly imbalanced multiclass classification problem. Customer service observations were at least ten times as prominent in the data set than the next most prominent class. This caused the model performance to be overwhelmed by the customer service class leading to low recall and precision for the less prominent classes.

I started with something simple initially applying class weights and cross entropy loss, but this didn’t do the trick. After a quick google search I discovered that the loss function focal loss has been used successfully to solve class imbalance. Focal loss reshapes the cross entropy loss to “down-weight” the loss assigned to well classified examples³.

The original paper on focal loss focussed on computer vision tasks where images had shallow depth of field. The image below is an example of shallow depth of field, the foreground is prominent but the background very low res. This type of extreme imbalance between foreground and background is analogous to the imbalance I had to deal with to classify the tweets.

Below I have laid out my implementation of focal loss within a custom trainer object.

note that the class weights (alpha) are hard coded. You will need to adjust these if you want to use this for you own purposes.

Implementation of the focal loss. Custom trainer is just standard trainer with added focal loss

Model Training

After a bit of customisation I was able to fit a model (and in under 7 minutes thanks to my GPU and CUDA). Focal loss vs. time gives us some evidence that the model was close to converging.

Image by Author: Focal loss vs time step

Model training script. Notice the customer trainer is imported and implemented here.

Model Performance

The model was assessed on the test data set which included 525 randomly selected labelled examples. The performance appears impressive, with fairly high precision and recall across all classes. I would caveat that test performance is probably optimistic due to the small sample size and there is likely to be more variance in the nature of these tweets outside of our sample. However, we are dealing with a relatively narrow domain (#royalmail) so variance is likely to be narrower than it would be for something more general purpose.

Image by Author: confusion matrix (test dataset)

Image by Author: model performance metrics on Test

To effectively visualize the wealth of information I gathered, I decided to create a sentiment map. By utilizing my trained model, I generated topics for tweets posted between January and March 2023. Additionally, I employed the pretrained twitter-roberta-base-sentiment model from Cardiff NLP to assess the sentiment of each tweet. To build the final web application, I used Streamlit.

script for generating the Streamlit web application

The current app serves as a basic prototype, but it can be expanded to uncover more profound insights. I’ll briefly discuss a few potential extensions below:

Temporal Filtering: Incorporate a date range filter, allowing users to explore tweets within specific time periods. This can help identify trends and changes in sentiment over time.
Interactive Visualizations: Implement interactive charts and visualizations that enable users to explore relationships between sentiment, topics, and other factors in the dataset.
Real-time Data: Connect the app to live Twitter data, enabling real-time analysis and visualization of sentiment and topics as they emerge.
Advanced Filtering: Provide more advanced filtering options, such as filtering by user, hashtag, or keyword, to allow for more targeted analysis of specific conversations and trends.

By extending the app with these features, you can provide users with a more powerful and insightful tool for exploring and understanding sentiment and topics in tweets.

Thanks for reading!

[1]Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. Paperswithcode.com. https://paperswithcode.com/paper/bertopic-neural-topic-modeling-with-a-class

[2]Barbieri, F., Anke, L. E., & Camacho-Collados, J. (2022). XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond. Paperswithcode.com. https://arxiv.org/abs/2104.12250

[3]Lin, T.-Y., Goyal, P., Girshick, R., He, K. and Dollar, P. (2018). Focal Loss for Dense Object Detection. Facebook AI Research (FAIR). [online] Available at: https://arxiv.org/pdf/1708.02002.pdf [Accessed 21 Mar. 2023].