Step by Step Basics: Text Classifier | by Lucy Dickinson | Feb, 2023

By Jessie Hobb On Feb 17, 2023

An Instructional Guide and Flow Diagram for Building a Supervised Machine Learning Text Classifier in Python

Let’s cut to the chase. There are a lot of steps involved in building a text classifier and understanding the world of Natural Language Processing (NLP). These steps have to be implemented in a specific order. There are even more steps required if the target class in the data is imbalanced. Learning this all from scratch can be a bit of a minefield. There are plenty of learning resources online, yet finding a holistic guide that covers everything in a high level proved tricky. So, I am writing this article to hopefully provide some transparency on this process with a 10 easy step guide.

I’m going to start with providing a flow diagram that I’ve compiled with all the necessary steps and key points to understand, all the way from clarifying the task to deploying a trained text classifier.

First of all, what is a text classifier?

A text classifier is an algorithm that learns the presence or pattern of words to predict some kind of target or outcome, usually a category such as whether an email is spam or not.

It is important to mention here that I will be focussing on building a text classifier using Supervised Machine Learning methods. An alternative approach is to use Deep Learning methods such as Neural Networks.

Let’s take a peek at that flow diagram.

There’s a lot to digest there. Let’s break it up into bitesize chunks and walk through each section.

1. Clarify the task

This is one of the most important steps of any data science project. Ensure that you have fully grasped the question that is being asked. Do you have the relevant data available to answer the question? Does your methodology align with what the stakeholder is expecting? If you need stakeholder buy in, don’t go building some super complex model that will be hard to interpret. Start simple, bring everyone along on that journey with you.

2. Data quality checks

Another essential step to any project. Your model will only be as good as the data that goes in, so make sure duplicates are removed and missing values are dealt with accordingly.

3. Exploratory Data Analysis (EDA)

Now we can move onto some text data specific analysis. EDA is all about understanding the data and getting a feel for what you can derive from it. One of the key points for this step is to understand the target class distribution. You can use either the pandas .value_counts() method or plot a bar chart to visualise the distribution of each class within the dataset. You’ll be able to see which are the majority and minority classes.

Imbalanced class distribution of a binary labelled dataset

Models do not perform well with imbalanced data. The model will often ignore the minority class(es) as there simply is not enough data to train the model to detect them. Alas, it’s not the end of the world if you find yourself with an imbalanced dataset with a heavy skew towards one of your target classes. That’s in fact quite normal. It’s just important to know this ahead of your model building process so you can adjust for this later on.

The presence of an imbalanced dataset should also get you thinking about which metrics you should use to assess model performance. In this instance, ‘accuracy’ (proportion of correct predictions) really isn’t your friend. Let’s say you have a dataset with a binary target class where 80% of data is labelled ‘red’ and 20% is labelled ‘blue’. Your model could simply predict ‘red’ for the entire test set and still be 80% accurate. Hence, the accuracy of a model may be misleading, given that your model could simply predict the majority class.

Some better metrics to use are recall (proportion of true positives predicted correctly), precision (proportion of positive predictions predicted correctly), or the mean of the two, the F1 score. Pay close attention to these scores for your minority classes once you’re in the model building stage. It’ll be these scores that you’ll want to improve.

4. Text pre-processing

Now on to some fun stuff! Text data can contain a whole load of stuff that just really isn’t useful to any machine learning model (depending on the nature of the task). This process is really about removing the ‘noise’ within your dataset, homogenising words and stripping it back to the bare bones so that only the useful words and ultimately, features, remain.

Generally, you’ll want to remove punctuation, special characters, stop-words (words like ‘this’, ‘the’, ‘and’) and reduce each word down to its lemma or stem. You can play around with making your own functions to get an idea of what’s in your data before cleansing it. Take the function below for example:

#  exploring patterns in the text to assess how best to cleanse the data
pat_list = [r'\d', '-', '\+', ':', '!', '\?', '\.', '\\n'] # list of special characters/punctuation to search for in datadef punc_search(df, col, pat):
"""
function that counts the number of narratives
that contain a pre-defined list of special
characters and punctuation
"""
for p in pat:
v = df[col].str.contains(p).sum() # total n_rows that contain the pattern
print(f'{p} special character is present in {v} entries')
punc_search(df, 'text', pat_list)
# the output will look something like this:
"""
\d special character is present in 12846 entries
- special character is present in 3141 entries
\+ special character is present in 71 entries
: special character is present in 1874 entries
! special character is present in 117 entries
\? special character is present in 53 entries
\. special character is present in 16962 entries
\n special character is present in 7567 entries
"""

Then when you’ve got a better idea of what needs to be removed from your data, have a go at writing a function that does it all for you in one go:

lemmatizer = WordNetLemmatizer()  # initiating lemmatiser objectdef text_cleanse(df, col):
"""
cleanses text by removing special
characters and lemmatizing each
word
"""
df[col] = df[col].str.lower()  # convert text to lowercase
df[col] = df[col].str.replace(r'-','', regex=True) # replace hyphens with '' to join hyphenated words together
df[col] = df[col].str.replace(r'\d','', regex=True) # replace numbers with ''
df[col] = df[col].str.replace(r'\\n','', regex=True) # replace new line symbol with ''
df[col] = df[col].str.replace(r'\W','', regex=True)  # remove special characters
df[col] = df[col].str.replace(r'\s+[a-zA-Z]\s+',' ', regex=True) # remove single characters
df[col] = df.apply(lambda x: nltk.word_tokenize(x[col]), axis=1) # tokenise text ready for lemmatisation
df[col] = df[col].apply(lambda x:[lemmatizer.lemmatize(word, 'v') for word in x]) # lemmatise words, use 'v' argument to lemmatise versbs (e.g. turns past participle of a verb to present tense)
df[col] = df[col].apply(lambda x : " ".join(x)) # de-tokenise text ready for vectorisation

You can then run the first function again on the cleansed data to check that the everything that you wanted to be removed has indeed been removed.

For those who noticed the functions above don’t remove any stop-words, well spotted. You can remove stop-words during the vectorisation process in a few steps time.

5. Train-test split

This is getting its own sub heading because it is so important to do this step BEFORE your start fiddling with the features. Split your data using sklearn’s train_test_split() function and then leave the test data alone so there’s no risk of data leakage.

If your data are imbalanced, there are a few optional arguments (‘shuffle’ and ‘stratify’) that you can specify within the test-train split to ensure an even split across your target classes. This ensures that your minority classes don’t end up all in your training or test set exclusively.

# create train and test data split
X_train, X_test, y_train, y_test = train_test_split(df['text'], # features
df['target'], # target
test_size=0.3, # 70% train 30% test
random_state=42, # ensures same split each time to allow repeatability
shuffle = True, # shuffles data prior to splitting
stratify = df['target']) # distribution of classes across train and test

6. Text vectorisation

Models cannot interpret words. Instead, the words have to be converted into numbers using a process known as vectorisation. There are two methods for vectorisation; Bag of Words and Word Embeddings. Bag of Words methods look for exact matches of words between texts, whereas Word Embedding methods take into account word context, and so can look for similar words between texts. An interesting article comparing the two methods can be found here.

For the Bag of Words method, sentences are tokenised and then each unique word becomes a feature. Each unique word in the dataset will correspond to a feature, where each feature will have an integer associated depending on how many times that word appears in the text (a Word Count Vector — sklearn’s CountVectorizer()) or a weighted integer that indicates the importance of the word in the text (a TF-IDF Vector — sklearn’s TfidVectorizer()). A useful article explaining TF-IDF vectorisation can be found here.

Be sure to train the vectoriser object on the training data and then use this to transform the test data.

7. Model selection

It’s a good idea to try out a few classification models to see which performs best with your data. You can then use performance metrics to select the most appropriate model to optimise. I did this by running a for loop which iterated over each model using the cross_validate() function.

#  defining models and associated parameters
models = [RandomForestClassifier(n_estimators = 100, max_depth=5, random_state=42), 
LinearSVC(random_state=42),
MultinomialNB(), 
LogisticRegression(random_state=42)]kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) # With StratifiedKFold, the folds are made by preserving the percentage of samples for each class.
scoring = ['accuracy', 'f1_macro', 'recall_macro', 'precision_macro']
#  iterative loop print metrics from each model
for model in tqdm(models):
model_name = model.__class__.__name__
result = cross_validate(model, X_train_vector, y_train, cv=kf, scoring=scoring)
print("%s: Mean Accuracy = %.2f%%; Mean F1-macro = %.2f%%; Mean recall-macro = %.2f%%; Mean precision-macro = %.2f%%" 
% (model_name, 
result['test_accuracy'].mean()*100, 
result['test_f1_macro'].mean()*100, 
result['test_recall_macro'].mean()*100, 
result['test_precision_macro'].mean()*100))

8. Baseline model

Before you get carried away with tweaking your chosen model’s hyperparameters in a bid to get those performance metrics up, STOP. Make a note of your model’s performance before you start optimising it. You’ll only be able to know (and prove) that your model improved by comparing it to the baseline scores. It helps you with stakeholder buy in and storytelling if you’re in a position where you’ve been asked to walk through your methodology.

Create an empty DataFrame, then after each model iteration, append your metric(s) of choice along with the number or name of the iteration so you can clearly see how your model progressed through your optimisation attempts.

9. Model tuning — rectifying imbalanced data

Generally, fine tuning your model might involve tweaking its hyperparameters and feature engineering with the aim of improving the model’s predictive capability. For this section however, I’m going to focus on the techniques that can be used to reduce the effect of class imbalance.

Short of collecting more data for the minority classes, there are 5 methods (that I know of) that you can use to address class imbalance. The majority are a form of feature engineering, with the aim of either oversampling the minority class(es) or undersampling the majority class(es) to even out the overall class distribution.

Let’s take a quick look at each method:

Adding a minority class penalty

Classification algorithms have a parameter, mostly known as ‘class_weight’ that you can specify when training the model. This is essentially a penalty function, where a higher penalty will be given if a minority class is misclassified in order to deter against misclassification. You can either elect for an automated argument, or you may be able to manually assign the penalty based on the class. Be sure to read the documentation for the algorithm you’re using.

2. Oversample minority class

Random oversampling involves randomly duplicating examples from the minority class(es) and adding them to the training dataset to create a uniform class distribution. This method can lead to overfitting as no new data points are being generated, so be sure to check for this.

The python library imblearn contains functions for oversampling and undersampling data. It is important to know that any oversampling or undersampling techniques are only applied to the training data.

If you are using a cross-validation method to fit the data to a model, you will need to use a pipeline to ensure that only the training folds are being oversampled. The Pipeline() function can be imported from the imblearn library.

over_pipe = Pipeline([('RandomOverSample', RandomOverSampler(random_state=42)), 
('LinearSVC', LinearSVC(random_state=42))])params = {"LinearSVC__C": [0.001, 0.01, 0.1, 1, 10, 100]}
svc_oversample_cv = GridSearchCV(over_pipe, 
param_grid = params, 
cv=kf, 
scoring='f1_macro',
return_train_score=True).fit(X_train_vector, y_train)
svc_oversample_cv.best_score_  # print f1 score

3. Undersample majority class

An alternative method to the above is to instead undersample the majority class, rather than oversample the majority class. Some might argue it’s never worth removing data if you have it, but this could be an option worth trying for yourself. Again, the imblearn library has oversampling functions to use.

4. Synthesise new instances of minority class

New instances of the minority classes can be generated using a process called SMOTE (Synthetic Minority Oversampling Technique), which again can be implemented using the imblearn library. There is a great article here that provides some examples of implementing SMOTE.

5. Text augmentation

New data can be generated using synonyms of existing data to increase the number of data points of minority classes. Methods involve synonym replacement and back translation (translating into one language and back to the original language). The nlpaug library is a handy library for exploring these options.