Primer on Cleaning Text Data. Cleaning text is an important part of… | by Seungjun (Josh) Kim | Sep, 2022

By Jessie Hobb On Sep 6, 2022

Cleaning text is an important part of NLP pre-processing

Free for Use Photo from Pexels

In the field of Natural Language Processing (NLP), pre-processing is an important stage where things like text cleaning, stemming, lemmatization, and Part of Speech (POS) Tagging take place. Among these various facets of NLP pre-processing, I will be covering a comprehensive list of text cleaning methods we can apply. Text cleaning here refers to the process of removing or transforming certain parts of the text so that the text becomes more easily understandable for NLP models that are learning the text. This often enables NLP models to perform better by reducing noise in text data.

The string package (which is a default package in Python) contains various useful functions for strings. The lower function is one of them, and turns all characters into lowercase.

def make_lowercase(token_list):
# Assuming word tokenization already happened    # Using list comprehension --> loop through every word/token, make it into lower case and add it to a new list
words = [word.lower() for word in token_list]        # join lowercase tokens into one string
cleaned_string = " ".join(words) 
return cleaned_string

string.punctuation in Python (It is the package aforementioned) contains the following items of punctuation.

#$%&\’()*+,-./:;?@[\\]^_{|}~`import stringtext = "It was a great night! Shout out to @Amy Lee for organizing wonderful event (a.k.a. on fire)."PUNCT_TO_REMOVE = string.punctuationans = text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))ans
>> "It was a great night Shout out to Amy Lee for organizing wonderful event aka on fire"

The translate function, another method in the string package, uses the input dictionary to perform the mapping. The maketrans function is a sibling method of the translate function that creates the dictionary to be used as an input for the translate method. Note that the maketrans function takes in 3 parameters and if a total of three arguments are passed, each character in the third argument is mapped to None. This characteristic can be utilized to remove characters in strings.

From the code snippet above, we specify the first and second arguments of the maketrans function as empty strings (since we don’t need those arguments) and specify the third argument to be the items of punctuation defined in string.punctuation above. Then, those punctuation characters in the string stored in the variable text will get removed.

text = "My cell phone number is 123456. Please take note."text_cleaned = ''.join([i for i in text if not i.isdigit()])text_cleaned
>> "My cell phone number is. Please take note."

You can also do the same thing using regular expressions, one of your best friends for string operations.

text_cleaned = [re.sub(r’\w*\d\w*’, ‘’, w) for w in text]text_cleaned
>> "My cell phone number is. Please take note."

As unstructured text data being generated from various social media platforms are increasing in volume, more text data contain non-typical characters like emojis. Emojis can be difficult for machines to interpret and may add unnecessary noise to your NLP model. This would be the case for removing emojis from your text data. However, if you are trying to sentiment analysis, trying to transform emojis into some text format instead of outright removing them may be beneficial as emojis can contain useful information about the sentiment associated with the text at hand. One way to do this is to create your own custom dictionary which maps different emojis to some text that denotes the same sentiment as the emoji (e.g. {🔥:fire}).

Check out this post that illustrates how to remove emojis from your text.

import redef remove_emoji(string):     emoji_pattern = re.compile(“[“     u”U0001F600-U0001F64F” # emoticons     u”U0001F300-U0001F5FF” # symbols & pictographs     u”U0001F680-U0001F6FF” # transport & map symbols     u”U0001F1E0-U0001F1FF” # flags (iOS)     u”U00002702-U000027B0"     u”U000024C2-U0001F251"     “]+”, flags=re.UNICODE)     return emoji_pattern.sub(r’’, string)remove_emoji(“game is on 🔥🔥”)>> 'game is on '

The contractions package in python (which you need to install using !pip install contractions) allows us to spell out contractions. Spelling out contractions can add more information to your text data by letting more tokens to be created when tokenization is performed. For instance, in the code snippet below, the token “would” is not considered as a separate token when word tokenization based on white space is performed. Instead, it lives as part of the token “She’d”. Once we fix the contractions, however, we see that the word “would” lives as a standalone token when word tokenization is performed. This adds more tokens for the NLP model to make use of. This may help the model better understand what the text means and thereby improve accuracy for various NLP tasks.

import contractionstext = “She'd like to hang out with you sometime!”contractions.fix(text)>> “She would like to hang out with you sometime!”

But since this package may not be 100% comprehensive (i.e. does not cover every single contraction that exists), you can also make your own custom dictionary that maps certain contractions that are not covered by the package to the spelled out versions of those contractions. This post shows you an example on how to do that!

We use Python’s BeautifulSoup package to strip HTML tags. This package is for web scraping but its html parser utility can be taken advantage of for stripping HTML tags like the following!

def strip_html_tags(text):
soup = BeautifulSoup(text, "html.parser")
stripped_text = soup.get_text()
return stripped_text
# Below is another variation for doing the same thingdef clean_html(html):     
# parse html content
soup = BeautifulSoup(html, "html.parser")     for data in soup(['style', 'script', 'code', 'a']):
# Remove tags
data.decompose( )     # return data by retrieving the tag content
return ' '.join(soup.stripped_strings)

import unicodedatadef remove_accent_chars(text):
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return text

We can utilize regular expressions to remove URLs, mentions, hashtags and special characters since they maintain certain structures and patterns. The following is just one example of how we can match the patterns of URLs, mentions and hashtags in strings and remove them. Remember there should be multiple approaches as there are multiple ways to form regular expressions to acquire the same output.

## Remove URLsimport redef remove_url(text):
return re.sub(r’https?:\S*’, ‘’, text)print(remove_url('The website https://www.spotify.com/ crashed last night due to high traffic.'))
>> 'The website crashed last night due to high traffic.'## Remove Mentions (@) and hastags (#)import redef remove_mentions_and_tags(text):
text = re.sub(r'@\S*', '', text)
return re.sub(r'#\S*', '', text)print(remove_mentions_and_tags('Thank you @Jay for your contribution to this project! #projectover'))
>> 'Thank you Jay for your contribution to this project! projectover'## Remove Special Charactersdef remove_spec_chars(text):
text = re.sub('[^a-zA-z0-9\s]', '' , text)
return texthttps://medium.com/mlearning-ai/nlp-a-comprehensive-guide-to-text-cleaning-and-preprocessing-63f364febfc5

Stopwords are some very common words which may contain very little value in helping select documents or modelling for NLP. Often, these words may be dropped or removed from text data when we perform pre-processing for NLP. This is because stop words may not add value to improving the accuracy of NLP models due to their excessive frequency. Just like the case for typical machine learning models, features with low variance are less valuable because they are not helpful for the model in making distinctions between different data points based on those features. Same goes for NLP where stop words can be considered as low variance features. Along the same line, stop words can lead to overfitting of the model which means the model we develop performs poorly for unseen data and lacks the ability to generalize to new data points.

# Retrieve stop word list from NLTK
stopword_list = nltk.corpus.stopwords.words(‘english’)stopword_list.remove(‘no’)stopword_list.remove(‘not’)from nltk.tokenize.toktok import ToktokTokenizertokenizer = ToktokTokenizer( )def remove_stopwords(text, is_lower_case=False):     tokens = tokenizer.tokenize(text)     tokens = [token.strip( ) for token in tokens] # List comprehension: loop through every token and strip white space     filtered_tokens = [token for token in tokens if token not in stopword_list] # Keep only the non stop word tokens in the list     filtered_text = ' '.join(filtered_tokens) # join all those tokens using a space as a delimiter    return filtered_text

Please note that there is another way to retrieve stop words from a different package called SpaCy, which is another useful package frequently used for NLP tasks. We can do so like the following:

import spacyen = spacy.load('en_core_web_sm') # load the english language small model of spacystopword_list = en.Defaults.stop_words

Just like any other Data Science task, pre-processing for NLP should not be done blindly. Consider what your objectives are. What are you trying to get out of removing hashtag and mention symbols from social media text data you scraped, for instance? Is it because those symbols do not add much value to the NLP model you are building to predict sentiment of some corpus? Unless you ask these questions and are able to answer them clearly, you should not be cleaning text on an ad-hoc basis. Please keep in mind that questioning the “why” is important in the field of Data Science.

In this article, we looked at a comprehensive list of ways to clean text before moving to the next stages of the NLP cycle such as lemmatization and code snippets on how to implement them.

If you found this post helpful, consider supporting me by signing up on medium via the following link : )

joshnjuny.medium.com

You will have access to so many useful and interesting articles and posts from not only me but also other authors!

Data Scientist. 1st Year PhD student in Informatics at UC Irvine.

Former research area specialist at the Criminal Justice Administrative Records System (CJARS) economics lab at the University of Michigan, working on statistical report generation, automated data quality review, building data pipelines and data standardization & harmonization. Former Data Science Intern at Spotify. Inc. (NYC).

He loves sports, working-out, cooking good Asian food, watching kdramas and making / performing music and most importantly worshiping Jesus Christ, our Lord. Checkout his website!