Categorize Free-Text Bank Transaction Descriptions Using BERT | by Jin Cui | Jan, 2023


Expense by Category. Chart by author

I purchased a property towards the end of calendar year 2022 with a mortgage. Given the increase in financial commitments, I wanted to keep a tab on my expenses. It had never occurred to me prior to this point, that I actually had no idea where I have been spending the most. Figuring this out may be a good starting point for my own expense management.

Naturally I turned to the bank transactions data which I downloaded from the online banking portal in a .csv format. A snippet of this for the last few days of 2022 is provided below.

Image 1: Writer’s bank transaction data. Image by author

Based on snippet above, it seems I spent proportionally more on food (as highlighted in green). More importantly, the transaction descriptions are free-text based, is there a way to automatically classify these into a number of pre-defined expense categories (e.g. food, grocery shopping, utilities and etc.)?

There is at least one way using pre-trained Large Language Models like BERT, and this article offers a tutorial as to how!

Whilst ChatGPT being a state-of-the-art Text Generation model is attracting a lot of attention at this time, it is generally not considered a General Purpose model — one such as BERT which can be used across multiple Natural Language Understanding tasks. Some examples of these are Grammar Detection, Sentiment Classification, Text Similarity, Q & A Inference and etc.

BERT was developed and released by Google in 2018. It’s a pre-trained model using text passages on Wikipedia and BookCorpus (to ensure the training data are grammatically sound).

The BERT model I’ll be using for the purpose of this tutorial is available on Hugging Face through the sentence_transformer library, which is a Python framework for creating sentence, text and image embeddings.

How do I ultimately convert the free-text transaction descriptions into an expense category? There are a couple of strategies I can think of. In this tutorial, I’ll be providing a step-by-step guide for building the Expense Classifier based on (cosine) similarity of word embeddings. The steps are outlined below:

  1. Manually label a credible number of transaction descriptions into an expense category (e.g. food, entertainment). This creates a set of labelled training data.
  2. Parse individual transaction descriptions in the training data above as word embeddings using BERT (i.e. convert texts into a numerical vector). Step 1 and Step 2 collectively ensure that the training data is assigned to a particular expense category as well as a word embedding vector.
  3. Repeat Step 2 for new transaction descriptions (i.e. convert unseen texts into a numerical vector)
  4. Pair the word embeddings in Step 3 with the most similar word embeddings from the training data, and assign the same expense category

This section sets out the Python codes for loading the required packages as well as for implementing the steps as outlined above (apart from Step 1 which is a manual labelling step).

Step 0: Import the required libraries

#for dataframe manipulation
import numpy as np
import pandas as pd

#regular expressoin toolkit
import re

#NLP toolkits
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#for plotting expense categories later
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import matplotlib
import matplotlib.ticker as ticker # for formatting major units on x-y axis

#for downloading BERT
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

#for finding most similar text vectors
from sklearn.metrics.pairwise import cosine_similarity

Step 1: Label training data

I manually labelled 200 transaction descriptions into a expense category. For instance, the transaction descriptions in Image 1 were assigned an expense category as shown in the image below. I have also assigned categories such as utilities (i.e. for electricity and gas), car and gift to other transactions in the training data.

Image 3: Manual label of training data. Image by author

Step 2: Create word embeddings for training data using BERT

We start by defining a function for cleaning the text data. This includes lower-casing words, removing special characters including dates (which are not useful in informing the expense category).

Stemming, lemmatization or removing stop words which are common practices in an NLP data cleaning pipeline are generally not required when using a BERT model due to its Byte-Pair Encoding and Attention mechanisms.

###############################################
### Define a function for NLP data cleaning ###
###############################################

def clean_text_BERT(text):

# Convert words to lower case.
text = text.lower()

# Remove special characters and numbers. This also removes the dates
# which are not important in classifying expenses
text = re.sub(r'[^\w\s]|https?://\S+|www\.\S+|https?:/\S+|[^\x00-\x7F]+|\d+', '', str(text).strip())

# Tokenise
text_list = word_tokenize(text)
result = ' '.join(text_list)
return result

We then apply the function to the transaction descriptions, loaded as text_raw from the dataframe shown in Image 1 (df_transaction_description).

text_raw = df_transaction_description['Description']
text_BERT = text_raw.apply(lambda x: clean_text_BERT(x))

The snippet below shows an example of a particular transaction before and after data cleaning was applied.

Image 2: Data cleaning example. Image by author

We then run the cleaned texts through BERT. I’ve selected the ‘paraphrase-mpnet-base-v2’ BERT model known for modelling sentence similarity. Per its documentation on Hugging Face, it maps sentences and paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

######################################
### Download pre-trained BERT model###
######################################

# This may take some time to download and run
# depending on the size of the input

bert_input = text_BERT.tolist()
model = SentenceTransformer('paraphrase-mpnet-base-v2')
embeddings = model.encode(bert_input, show_progress_bar = True)
embedding_BERT = np.array(embeddings)

A snippet of the word embeddings for the first few transactions is provided below:

Image 4: BERT embeddings. Image by author

Step 3: Create word embeddings for unseen data

I’ve selected 20 transactions from the data which didn’t make the training data (for the purpose of this tutorial, randomly selected transactions). These are shown in the image below.

Image 5: Unseen transactions. Image by author

The transaction descriptions above are loaded as text_test_raw. Similar to Step 2, these are run through BERT for embedding.

# Load texts
text_test_raw = df_transaction_description_test['Test']

# Apply data cleaning function as for training data
text_test_BERT = text_test_raw.apply(lambda x: clean_text_BERT(x))

# Apply BERT embedding
bert_input_test = text_test_BERT.tolist()
#model = SentenceTransformer('paraphrase-mpnet-base-v2')
embeddings_test = model.encode(bert_input_test, show_progress_bar = True)
embedding_BERT_test = np.array(embeddings_test)

df_embedding_bert_test = pd.DataFrame(embeddings_test)

Step 4: Pair unseen data with most similar training data


# Find the most similar word embedding with unseen data in the training data

similarity_new_data = cosine_similarity(embedding_BERT_test, embedding_BERT)
similarity_df = pd.DataFrame(similarity_new_data)

# Returns index for most similar embedding
# See first column of the output dataframe below
index_similarity = similarity_df.idxmax(axis = 1)

# Return dataframe for most similar embedding/transactions in training dataframe
data_inspect = df_transaction_description.iloc[index_similarity, :].reset_index(drop = True)

unseen_verbatim = text_test_raw
matched_verbatim = data_inspect['Description']
annotation = data_inspect['Class']

d_output = {
'unseen_transaction': unseen_verbatim,
'matched_transaction': matched_verbatim,
'matched_class': annotation

}

d_output dataframe shows that the unseen data have been assigned a fairly reasonable expense category.

Image 6: Unseen data matched with training data. Image by author

Now whenever new expenses come through, simply feed them to the model!

Bonus Step: Plotting expenses by category

I have actually applied the steps above to all my expenses in calendar year 2022. The plot below shows the resultant expense dollar amounts by the assigned category.

Chart 7: Expense plot by category. Chart by author

Key observations were:

  • I spent the most on Food in 2022, followed by Mortgage repayments and Utility bills.
  • Although Credit Card repayment had the highest amount, it is assumed that Credit Card spending could be attributed to other expense categories in the same proportion. This assumption also applies to the PayPal category.
  • Based on the data, I probably want to cut back on spending on Food for more spending on Groceries (i.e. start cooking at home as opposed to dining out) in 2023.
  • My spending on Beauty products was probably driven by instances where I went shopping with the wife…

In addition, it’s super easy to return the transactions with the highest spending within a particular category. For instance, my highest spending in the Food expense category in 2022 were shown in the screen print below. I’m happy with the results as some of these restaurants weren’t present in the training data. Despite this, BERT was still able to allocate these transactions to the Food category.

Image 7: Top expenses.Image by author

This article provides a comprehensive tutorial in building an expense tracking tool. All I’ve done really is translating the free-text transaction descriptions to a language the machine understands using BERT, and letting the machine do the hard yards!

An alternative approach is to replace Step 4 of this tutorial by passing the same wording embeddings through a classification model — something for the readers to experiment with further.

If you like this article of mine, feel free to have a read of the others.


Expense by Category. Chart by author

I purchased a property towards the end of calendar year 2022 with a mortgage. Given the increase in financial commitments, I wanted to keep a tab on my expenses. It had never occurred to me prior to this point, that I actually had no idea where I have been spending the most. Figuring this out may be a good starting point for my own expense management.

Naturally I turned to the bank transactions data which I downloaded from the online banking portal in a .csv format. A snippet of this for the last few days of 2022 is provided below.

Image 1: Writer’s bank transaction data. Image by author

Based on snippet above, it seems I spent proportionally more on food (as highlighted in green). More importantly, the transaction descriptions are free-text based, is there a way to automatically classify these into a number of pre-defined expense categories (e.g. food, grocery shopping, utilities and etc.)?

There is at least one way using pre-trained Large Language Models like BERT, and this article offers a tutorial as to how!

Whilst ChatGPT being a state-of-the-art Text Generation model is attracting a lot of attention at this time, it is generally not considered a General Purpose model — one such as BERT which can be used across multiple Natural Language Understanding tasks. Some examples of these are Grammar Detection, Sentiment Classification, Text Similarity, Q & A Inference and etc.

BERT was developed and released by Google in 2018. It’s a pre-trained model using text passages on Wikipedia and BookCorpus (to ensure the training data are grammatically sound).

The BERT model I’ll be using for the purpose of this tutorial is available on Hugging Face through the sentence_transformer library, which is a Python framework for creating sentence, text and image embeddings.

How do I ultimately convert the free-text transaction descriptions into an expense category? There are a couple of strategies I can think of. In this tutorial, I’ll be providing a step-by-step guide for building the Expense Classifier based on (cosine) similarity of word embeddings. The steps are outlined below:

  1. Manually label a credible number of transaction descriptions into an expense category (e.g. food, entertainment). This creates a set of labelled training data.
  2. Parse individual transaction descriptions in the training data above as word embeddings using BERT (i.e. convert texts into a numerical vector). Step 1 and Step 2 collectively ensure that the training data is assigned to a particular expense category as well as a word embedding vector.
  3. Repeat Step 2 for new transaction descriptions (i.e. convert unseen texts into a numerical vector)
  4. Pair the word embeddings in Step 3 with the most similar word embeddings from the training data, and assign the same expense category

This section sets out the Python codes for loading the required packages as well as for implementing the steps as outlined above (apart from Step 1 which is a manual labelling step).

Step 0: Import the required libraries

#for dataframe manipulation
import numpy as np
import pandas as pd

#regular expressoin toolkit
import re

#NLP toolkits
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#for plotting expense categories later
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import matplotlib
import matplotlib.ticker as ticker # for formatting major units on x-y axis

#for downloading BERT
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

#for finding most similar text vectors
from sklearn.metrics.pairwise import cosine_similarity

Step 1: Label training data

I manually labelled 200 transaction descriptions into a expense category. For instance, the transaction descriptions in Image 1 were assigned an expense category as shown in the image below. I have also assigned categories such as utilities (i.e. for electricity and gas), car and gift to other transactions in the training data.

Image 3: Manual label of training data. Image by author

Step 2: Create word embeddings for training data using BERT

We start by defining a function for cleaning the text data. This includes lower-casing words, removing special characters including dates (which are not useful in informing the expense category).

Stemming, lemmatization or removing stop words which are common practices in an NLP data cleaning pipeline are generally not required when using a BERT model due to its Byte-Pair Encoding and Attention mechanisms.

###############################################
### Define a function for NLP data cleaning ###
###############################################

def clean_text_BERT(text):

# Convert words to lower case.
text = text.lower()

# Remove special characters and numbers. This also removes the dates
# which are not important in classifying expenses
text = re.sub(r'[^\w\s]|https?://\S+|www\.\S+|https?:/\S+|[^\x00-\x7F]+|\d+', '', str(text).strip())

# Tokenise
text_list = word_tokenize(text)
result = ' '.join(text_list)
return result

We then apply the function to the transaction descriptions, loaded as text_raw from the dataframe shown in Image 1 (df_transaction_description).

text_raw = df_transaction_description['Description']
text_BERT = text_raw.apply(lambda x: clean_text_BERT(x))

The snippet below shows an example of a particular transaction before and after data cleaning was applied.

Image 2: Data cleaning example. Image by author

We then run the cleaned texts through BERT. I’ve selected the ‘paraphrase-mpnet-base-v2’ BERT model known for modelling sentence similarity. Per its documentation on Hugging Face, it maps sentences and paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

######################################
### Download pre-trained BERT model###
######################################

# This may take some time to download and run
# depending on the size of the input

bert_input = text_BERT.tolist()
model = SentenceTransformer('paraphrase-mpnet-base-v2')
embeddings = model.encode(bert_input, show_progress_bar = True)
embedding_BERT = np.array(embeddings)

A snippet of the word embeddings for the first few transactions is provided below:

Image 4: BERT embeddings. Image by author

Step 3: Create word embeddings for unseen data

I’ve selected 20 transactions from the data which didn’t make the training data (for the purpose of this tutorial, randomly selected transactions). These are shown in the image below.

Image 5: Unseen transactions. Image by author

The transaction descriptions above are loaded as text_test_raw. Similar to Step 2, these are run through BERT for embedding.

# Load texts
text_test_raw = df_transaction_description_test['Test']

# Apply data cleaning function as for training data
text_test_BERT = text_test_raw.apply(lambda x: clean_text_BERT(x))

# Apply BERT embedding
bert_input_test = text_test_BERT.tolist()
#model = SentenceTransformer('paraphrase-mpnet-base-v2')
embeddings_test = model.encode(bert_input_test, show_progress_bar = True)
embedding_BERT_test = np.array(embeddings_test)

df_embedding_bert_test = pd.DataFrame(embeddings_test)

Step 4: Pair unseen data with most similar training data


# Find the most similar word embedding with unseen data in the training data

similarity_new_data = cosine_similarity(embedding_BERT_test, embedding_BERT)
similarity_df = pd.DataFrame(similarity_new_data)

# Returns index for most similar embedding
# See first column of the output dataframe below
index_similarity = similarity_df.idxmax(axis = 1)

# Return dataframe for most similar embedding/transactions in training dataframe
data_inspect = df_transaction_description.iloc[index_similarity, :].reset_index(drop = True)

unseen_verbatim = text_test_raw
matched_verbatim = data_inspect['Description']
annotation = data_inspect['Class']

d_output = {
'unseen_transaction': unseen_verbatim,
'matched_transaction': matched_verbatim,
'matched_class': annotation

}

d_output dataframe shows that the unseen data have been assigned a fairly reasonable expense category.

Image 6: Unseen data matched with training data. Image by author

Now whenever new expenses come through, simply feed them to the model!

Bonus Step: Plotting expenses by category

I have actually applied the steps above to all my expenses in calendar year 2022. The plot below shows the resultant expense dollar amounts by the assigned category.

Chart 7: Expense plot by category. Chart by author

Key observations were:

  • I spent the most on Food in 2022, followed by Mortgage repayments and Utility bills.
  • Although Credit Card repayment had the highest amount, it is assumed that Credit Card spending could be attributed to other expense categories in the same proportion. This assumption also applies to the PayPal category.
  • Based on the data, I probably want to cut back on spending on Food for more spending on Groceries (i.e. start cooking at home as opposed to dining out) in 2023.
  • My spending on Beauty products was probably driven by instances where I went shopping with the wife…

In addition, it’s super easy to return the transactions with the highest spending within a particular category. For instance, my highest spending in the Food expense category in 2022 were shown in the screen print below. I’m happy with the results as some of these restaurants weren’t present in the training data. Despite this, BERT was still able to allocate these transactions to the Food category.

Image 7: Top expenses.Image by author

This article provides a comprehensive tutorial in building an expense tracking tool. All I’ve done really is translating the free-text transaction descriptions to a language the machine understands using BERT, and letting the machine do the hard yards!

An alternative approach is to replace Step 4 of this tutorial by passing the same wording embeddings through a classification model — something for the readers to experiment with further.

If you like this article of mine, feel free to have a read of the others.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsBankbertCategorizecuiDescriptionsFreeTextJanJinmachine learningTechnoblendertransaction
Comments (0)
Add Comment