The Power of Transformers in Predicting Twitter Account Identities | by John Adeojo | Mar, 2023
Leveraging Large Language Models for Advanced NLP
How to Use State-of-the-Art Models for Accurate Text Classification
This project aims to build a model capable of predicting the identity of account from it’s tweets. I will walk through the steps I have taken from data processing, to fine tuning, and performance evaluation of the models.
Before proceeding I would caveat that identity here is defined as male, female, or a brand. This in no way reflects my views on gender identity, this is simply a toy project demonstrating the power of transformers for sequence classification. In some of the code snippets you may notice gender is being used where we are referring to identify, this is simply how the data arrived.
Due to the complex nature of text data, non-linear relationships being modelled I eliminated simpler methods and chose to leverage pretrained transformer models for this project.
Transformers are the current state-of-the-art for natural language processing and understanding tasks. The Transformer library from Hugging face gives you access to thousands of pre-trained models along with APIs to perform your own fine tuning. Most of the models have been trained on large text corpora, some across multiple languages. Without any fine tuning they have been shown to perform very well on similar text classification tasks including; sentiment analysis, emotion detection, and hate speech recognition.
I chose two models to fine tune along with a zero-shot model as a baseline for comparison.
Zero-shot learning gives a baseline estimate of how powerful a transformer can be without fine-tuning on your particular classification task.
Notebooks, Models & Repos
Due to computational cost I can’t make the training scripts interactive. However, I have made the performance analysis notebook and models available to you. You can try the models yourself with live tweets!
📒Notebook: Model performance analysis Jupyter notebook
🤗Finetuned Distilbert-Base-Multilingual-Cased: Model 1
🤗Finetuned Albert-base-v2 : Model 2
💻Github repository : Training Scripts
💾Data Source: Kaggle
The Data was provided by the Data For Everyone Library on Crowdflower. You can download the data yourself on Kaggle⁴.
Note: The data has a public domain license⁴.
In total there are around 20k records containing usernames, tweets, user descriptions, and other twitter profile information. Although time constraints have not allowed me to check in detail, it’s clear from a quick inspection that the tweets are multilingual. However, tweet text is messy with URLs, ascii characters, and special characters. This is to be expected from social media data, fortunately it’s trivial to clean this with regular expressions.
Profile image data is supplied in the form of URL links to image files. However many of these links are corrupted and therefore not useful in this prediction task. Ordinarily one might expect that profile images would be a great predictor for the identity of an account holder, in this case the data quality issues were too vast to overcome. Due to this I decided to use the tweet text and user descriptions for modelling.
Missing & Unknown Variables
There is an identity label provided for most accounts. The label is well populated and has the values female, male, brand, and unknown — only 5.6% of all accounts are labelled unknown. Accounts where the identity label was unknown were simply removed from the analysis as they are impossible to test or train on.
Approximately 19% of user descriptions were blank. Having a blank user might signal something about the account holder’s identity. Where the user description was blank, I simply imputed some text indicating this to allow the model to learn from these cases.
Expanding the Data
To create more examples for the model to learn from, I concatenated the user descriptions and tweet text into a general twitter text field effectively doubling the number of text samples.
Train, Validation, Test
I split the data into 70% training, 15% validation, and 15% testing. To ensure no overlap, if there was an account that appeared multiple times in the data, I automatically assigned all the instances of it to the training data set. Besides this, accounts were allocated randomly to each of the data sets according to the proportions stated.
Fine tuning was completed on each model separately and required a GPU to be practically achievable. The exact specs of my laptop’s GPU is the Nvidia GE Force RTX 2060.
Although this is considered high spec for a personal laptop, I found performance suffered on some of the larger language models ultimately limiting the set of model I could experiment with.
To fully utilise my GPU I had to install the appropriate CUDA kit for my GPU version and the version of Pytorch I was using.
CUDA is a platform that enables your computer to perform parallel computations on data. This can drastically speed up the time it takes to fine tune transformers.
It isn’t advised to run this type of fine-tuning without a CUDA enabled GPU, unless you’re happy to leave your machine running for what could be days.
Python Packages
All steps of the modelling process were scripted in python. I leveraged the open-source Transformers library available from Hugging Face. I find this library to be well maintained with ample documentation available for guidance on best practices.
For model performance testing, I used the open-source machine learning and data wrangling tools commonly used by data scientists. The list of key packages are as follows; Transformers, Sci-kit Learn, Pandas, Numpy, Seaborn, Matplotlib, and Pytorch
Environment Management
Anaconda as my primary environment manager creating a Conda virtual environment to install all software dependencies. I would strongly advise on this approach due to the large number of potentially conflicting dependencies.
The models were fine tuned by training on the train data set and evaluating performance on the validation set. I have configured the fine tuning process to return the best model according to performance on the validation data set.
Since this is a multiclass classification problem, the loss metric being minimised is the cross-entropy loss. Better model performance is essentially a lower cross entropy loss on the validation set. Hyper parameters for the candidate models were set identical to each other to aid comparison.
I begin my analysis by performing a zero-shot classification to give a baseline from which to assess the fine-tuned models. The reference text for this model suggests that it can perform inference on 100+ languages¹, which appears to be excellent coverage for our problem.
Distilbert-base-multilingual-cased has been trained on 104 different languages, also providing great coverage. The model is cased so it can recognises capitalisation and non-capitalisation in text.
Model (pre)-training: The model has been pretrained on a concatenation of Wikipedia pages.
Model architecture: Transformer-based language model with 6 layers, 769 dimensions and 12 heads totalling 134 Million parameters.
Fine tuning: Model fine tuning took approximately 21 minutes running on my hardware. There is some evidence to suggest the model had converged from reviewing the evaluation loss vs. the training step chart.
The model has been pretrained on English text and is uncased meaning it retains no information about capitalisation of text. Albert was specifically designed to address memory limitation that occur with training larger models. The model uses a self-supervised loss that focuses on modelling inter-sentence coherence.
Model (pre)-training: Albert was pretrained on the BOOKCORPUS and English Wikipedia to achieve its baseline.
Model architecture: transformer-based language model with 12 repeating layers, 128 embedding, 768-hidden, 12 heads and 11 million parameters.
Fine tuning: Model fine tuning took approximately 35 minutes to complete. Model convergence appears to be likely indicated by the “trough” of the loss metric.
Given that this is a multiclass learning task, I have assessed model performance over F1, Recall, Precision and Accuracy at both the individual class and global level. Performance metrics were scored on the test data set.
Accuracy scores were 37% for zero-shot, 59% for Albert and 59% for Distilbert overall.
Observations
Overall, both Albert and Distilbert performed better on the test set than the zero-shot classification baseline. This is the result I was expecting given that the zero-shot model does not hold any knowledge of the classification task at hand. I believe this is more evidence that there is merit in fine tuning your model.
Although there are notable performance differences, we can’t definitively say which is better between the two fine tuned models until we have a prolonged test period of these models in the wild.
Notable performance differences
Albert appeared to be more confident with its predictions having a 75th percentile for overall prediction confidence of 82% compared to Distilbert’s 66%.
All models had low precision, recall, and F1 for predicting a male identity. This might be due to wider variation in male tweets compared with female and brand.
All models had high performance scores on predicting brands relative to the other identities. Also, models had notably higher confidence in predicting brands than they did for predicting male or female users. I would imagine this is due to the standardised way brands put out their messaging on social media relative to personal users.
I would recommend the following to improve model uplift:
Increased training examples
More data can help the model to generalise better improving overall performance. There was certainly evidence of overfitting as I noticed model performance on the evaluation set began to suffer while performance on the test set continued to improve, more data would help to alleviate this somewhat.
Overfitting was more the case with the Distilbert model than Albert due to it’s larger size. Large language models are more flexible but can also be more prone to overfitting.
Fine tuning the twitter-xlm-roberta-base model on multiple GPUs to achieve convergence
There is a model by Cardiff NLP explicitly pretrained on twitter text and is multilingual. I did make an attempt at fine tuning this model but was limited by hardware. The model is large at 198M parameters and took almost 4 hours to run showing no signs of convergence. In theory, Roberta should greatly outperform Distilbert and Albert due to its pre-training on twitter data. However, more data would be required to prevent the likely overfitting on this larger model.
Explore the potential of multi-modal transformer architectures.
If we could improve the quality of the profile picture data, I think a combination of tweet text and image could significantly improve the performance of our classifier.
Thanks for reading
[1] Laurer, M., van Atteveldt, W., Salleras Casas, A., & Welbers, K. (2022). Less Annotating, More Classifying — Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT — NLI [Preprint]. Open Science Framework.
[2] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
[3] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR, abs/1909.11942. http://arxiv.org/abs/1909.11942
[4] Twitter User Gender Classification. Kaggle. Retrieved March 15, 2023, from https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification
Leveraging Large Language Models for Advanced NLP
How to Use State-of-the-Art Models for Accurate Text Classification
This project aims to build a model capable of predicting the identity of account from it’s tweets. I will walk through the steps I have taken from data processing, to fine tuning, and performance evaluation of the models.
Before proceeding I would caveat that identity here is defined as male, female, or a brand. This in no way reflects my views on gender identity, this is simply a toy project demonstrating the power of transformers for sequence classification. In some of the code snippets you may notice gender is being used where we are referring to identify, this is simply how the data arrived.
Due to the complex nature of text data, non-linear relationships being modelled I eliminated simpler methods and chose to leverage pretrained transformer models for this project.
Transformers are the current state-of-the-art for natural language processing and understanding tasks. The Transformer library from Hugging face gives you access to thousands of pre-trained models along with APIs to perform your own fine tuning. Most of the models have been trained on large text corpora, some across multiple languages. Without any fine tuning they have been shown to perform very well on similar text classification tasks including; sentiment analysis, emotion detection, and hate speech recognition.
I chose two models to fine tune along with a zero-shot model as a baseline for comparison.
Zero-shot learning gives a baseline estimate of how powerful a transformer can be without fine-tuning on your particular classification task.
Notebooks, Models & Repos
Due to computational cost I can’t make the training scripts interactive. However, I have made the performance analysis notebook and models available to you. You can try the models yourself with live tweets!
📒Notebook: Model performance analysis Jupyter notebook
🤗Finetuned Distilbert-Base-Multilingual-Cased: Model 1
🤗Finetuned Albert-base-v2 : Model 2
💻Github repository : Training Scripts
💾Data Source: Kaggle
The Data was provided by the Data For Everyone Library on Crowdflower. You can download the data yourself on Kaggle⁴.
Note: The data has a public domain license⁴.
In total there are around 20k records containing usernames, tweets, user descriptions, and other twitter profile information. Although time constraints have not allowed me to check in detail, it’s clear from a quick inspection that the tweets are multilingual. However, tweet text is messy with URLs, ascii characters, and special characters. This is to be expected from social media data, fortunately it’s trivial to clean this with regular expressions.
Profile image data is supplied in the form of URL links to image files. However many of these links are corrupted and therefore not useful in this prediction task. Ordinarily one might expect that profile images would be a great predictor for the identity of an account holder, in this case the data quality issues were too vast to overcome. Due to this I decided to use the tweet text and user descriptions for modelling.
Missing & Unknown Variables
There is an identity label provided for most accounts. The label is well populated and has the values female, male, brand, and unknown — only 5.6% of all accounts are labelled unknown. Accounts where the identity label was unknown were simply removed from the analysis as they are impossible to test or train on.
Approximately 19% of user descriptions were blank. Having a blank user might signal something about the account holder’s identity. Where the user description was blank, I simply imputed some text indicating this to allow the model to learn from these cases.
Expanding the Data
To create more examples for the model to learn from, I concatenated the user descriptions and tweet text into a general twitter text field effectively doubling the number of text samples.
Train, Validation, Test
I split the data into 70% training, 15% validation, and 15% testing. To ensure no overlap, if there was an account that appeared multiple times in the data, I automatically assigned all the instances of it to the training data set. Besides this, accounts were allocated randomly to each of the data sets according to the proportions stated.
Fine tuning was completed on each model separately and required a GPU to be practically achievable. The exact specs of my laptop’s GPU is the Nvidia GE Force RTX 2060.
Although this is considered high spec for a personal laptop, I found performance suffered on some of the larger language models ultimately limiting the set of model I could experiment with.
To fully utilise my GPU I had to install the appropriate CUDA kit for my GPU version and the version of Pytorch I was using.
CUDA is a platform that enables your computer to perform parallel computations on data. This can drastically speed up the time it takes to fine tune transformers.
It isn’t advised to run this type of fine-tuning without a CUDA enabled GPU, unless you’re happy to leave your machine running for what could be days.
Python Packages
All steps of the modelling process were scripted in python. I leveraged the open-source Transformers library available from Hugging Face. I find this library to be well maintained with ample documentation available for guidance on best practices.
For model performance testing, I used the open-source machine learning and data wrangling tools commonly used by data scientists. The list of key packages are as follows; Transformers, Sci-kit Learn, Pandas, Numpy, Seaborn, Matplotlib, and Pytorch
Environment Management
Anaconda as my primary environment manager creating a Conda virtual environment to install all software dependencies. I would strongly advise on this approach due to the large number of potentially conflicting dependencies.
The models were fine tuned by training on the train data set and evaluating performance on the validation set. I have configured the fine tuning process to return the best model according to performance on the validation data set.
Since this is a multiclass classification problem, the loss metric being minimised is the cross-entropy loss. Better model performance is essentially a lower cross entropy loss on the validation set. Hyper parameters for the candidate models were set identical to each other to aid comparison.
I begin my analysis by performing a zero-shot classification to give a baseline from which to assess the fine-tuned models. The reference text for this model suggests that it can perform inference on 100+ languages¹, which appears to be excellent coverage for our problem.
Distilbert-base-multilingual-cased has been trained on 104 different languages, also providing great coverage. The model is cased so it can recognises capitalisation and non-capitalisation in text.
Model (pre)-training: The model has been pretrained on a concatenation of Wikipedia pages.
Model architecture: Transformer-based language model with 6 layers, 769 dimensions and 12 heads totalling 134 Million parameters.
Fine tuning: Model fine tuning took approximately 21 minutes running on my hardware. There is some evidence to suggest the model had converged from reviewing the evaluation loss vs. the training step chart.
The model has been pretrained on English text and is uncased meaning it retains no information about capitalisation of text. Albert was specifically designed to address memory limitation that occur with training larger models. The model uses a self-supervised loss that focuses on modelling inter-sentence coherence.
Model (pre)-training: Albert was pretrained on the BOOKCORPUS and English Wikipedia to achieve its baseline.
Model architecture: transformer-based language model with 12 repeating layers, 128 embedding, 768-hidden, 12 heads and 11 million parameters.
Fine tuning: Model fine tuning took approximately 35 minutes to complete. Model convergence appears to be likely indicated by the “trough” of the loss metric.
Given that this is a multiclass learning task, I have assessed model performance over F1, Recall, Precision and Accuracy at both the individual class and global level. Performance metrics were scored on the test data set.
Accuracy scores were 37% for zero-shot, 59% for Albert and 59% for Distilbert overall.
Observations
Overall, both Albert and Distilbert performed better on the test set than the zero-shot classification baseline. This is the result I was expecting given that the zero-shot model does not hold any knowledge of the classification task at hand. I believe this is more evidence that there is merit in fine tuning your model.
Although there are notable performance differences, we can’t definitively say which is better between the two fine tuned models until we have a prolonged test period of these models in the wild.
Notable performance differences
Albert appeared to be more confident with its predictions having a 75th percentile for overall prediction confidence of 82% compared to Distilbert’s 66%.
All models had low precision, recall, and F1 for predicting a male identity. This might be due to wider variation in male tweets compared with female and brand.
All models had high performance scores on predicting brands relative to the other identities. Also, models had notably higher confidence in predicting brands than they did for predicting male or female users. I would imagine this is due to the standardised way brands put out their messaging on social media relative to personal users.
I would recommend the following to improve model uplift:
Increased training examples
More data can help the model to generalise better improving overall performance. There was certainly evidence of overfitting as I noticed model performance on the evaluation set began to suffer while performance on the test set continued to improve, more data would help to alleviate this somewhat.
Overfitting was more the case with the Distilbert model than Albert due to it’s larger size. Large language models are more flexible but can also be more prone to overfitting.
Fine tuning the twitter-xlm-roberta-base model on multiple GPUs to achieve convergence
There is a model by Cardiff NLP explicitly pretrained on twitter text and is multilingual. I did make an attempt at fine tuning this model but was limited by hardware. The model is large at 198M parameters and took almost 4 hours to run showing no signs of convergence. In theory, Roberta should greatly outperform Distilbert and Albert due to its pre-training on twitter data. However, more data would be required to prevent the likely overfitting on this larger model.
Explore the potential of multi-modal transformer architectures.
If we could improve the quality of the profile picture data, I think a combination of tweet text and image could significantly improve the performance of our classifier.
Thanks for reading
[1] Laurer, M., van Atteveldt, W., Salleras Casas, A., & Welbers, K. (2022). Less Annotating, More Classifying — Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT — NLI [Preprint]. Open Science Framework.
[2] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
[3] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR, abs/1909.11942. http://arxiv.org/abs/1909.11942
[4] Twitter User Gender Classification. Kaggle. Retrieved March 15, 2023, from https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification