Techno Blender
Digitally Yours.

Raw text correction with Fuzzy Matching for NLP tasks | by Aviad klinger | Jun, 2022

0 81


Learn how to fix misspelled words to better identify essential text expressions

Photo by Diomari Madulara on Unsplash

Natural Language Processing (NLP) is being used today in many ML tasks and projects in healthcare, finance, marketing and more. Data Scientists often struggle to clean and analyze text data in order to retrieve insight. For most NLP tasks it is common to use techniques such as tokenization, stemming, lemmatization etc.

However, in some cases there is a need to keep the raw text intact and not split it into tokens. For instance, in data De-Identification which is a private case of Named Entity Recognition (NER), a method used for recognizing different entities in a document, the output of the method displays the original text with labels replacing the desired entities.

In these cases, correcting misspelled or erroneous terms can be challenging. This post will explain how to achieve this task using a combination of RegEx and Fuzzy String Matching.

Fuzzy String Matching

Fuzzy string matching is a technique that finds strings which approximately match a given string pattern. The algorithm behind fuzzy string matching uses a distance metric such as the Levenshtein distance to calculate the differences between two strings by identifying the minimum number of alterations needed to be done to convert the first string into the second string. We will use the python library Fuzzywuzzy to perform this task.

Installation and example:

pip install fuzzywuzzyfrom fuzzywuzzy import fuzzfuzz.ratio("appls","apples")#91

For this example we got a similarity score of 91 out of 100, so the words are very similar. Now we can consider what threshold to use in order to decide whether or not to “correct” the original word.

RegEx

RegEx is short for Regular Expressions and is a special text string that specifies a search pattern in text. This pattern is basically a language that defines exactly what to look for in a text string. For example, if we want to extract all characters other than numbers the regex pattern will be:

[^0-9]+

If we want to extract all e-mail addresses the regex pattern will be:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}

We will use the python library re to perform this task.

Installation and example:

pip install reimport restring = "my e-mail is [email protected]"pattern=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'print(re.search(pattern,string).group())# [email protected]

Raw text correction

Getting back to the original problem, what to do when we need to fix erroneous words or phrases but keep the raw text intact and not split it into tokens. Keeping the text intact also means keeping exactly the same spacing, tabs, line breaks, punctuation etc.

Let’s say that during a NER task we want to tag medications given to patients in a hospital. This information is available in the doctors’ notes section of the Electronic Health Records (EMR) system. Here is one entry for example:

The patient XXX was hospitalized last week.

He was given moxiperil.

He is male, 65 years old, with a history of heart disease.

The correct name of the medication is “moexipril”. In order to correct this we need to:

  1. Prepare a list of keywords we want to search for, in this case only one keyword
  2. Decide on a similarity threshold (default is 85)
  3. Split the text into tokens
  4. Run Fuzzy matching between the keyword and each token
  5. If the similarity score exceeds the pre-determined threshold then replace the token with the keyword
  6. Put it all back together

We can do this using a function as follows:

from fuzzywuzzy import fuzz
import re
def fuzzy_replace(keyword_str, text_str, threshold=85):
l = len(keyword_str.split())
splitted = re.split(r'(\W+)',text_str) #split, keep linebreaks
for i in range(len(splitted)-l+1):
temp = "".join(splitted[i:i+l])
if fuzz.ratio(keyword_str, temp) >= threshold:
before = "".join(splitted[:i])
after = "".join(splitted[i+l:])
text_str= before + keyword_str + after
splitted = re.split(r'(\W+)',text_str)
return text_str

After running this function the text output is now corrected and the text original structure is preserved:

The patient XXX was hospitalized last week.

He was given moexipril.

He is male, 65 years old, with a history of heart disease.

Let’s look now at a more complex example. This time the medical notes contain a few different medications that we’d like to correct and also some occur in the text more than once. To address this issue we define a list that holds all the correct names of the medications and simply loop through the text to find what needs to be corrected. The following code snippet shows how to achieve this:

meds = ["moexipril", "vasotec", "candesartan"]
text = """The patient XXX was hospitalized last week.He was given moxiperil and vasotek.He is male, 65 years old, with a history of heart disease.Patient has been taking vasotek for several years.In the past was given candasarta."""
for med in meds:
text = fuzzy_replace(med, text)

The result is the same text with all medication names corrected.

The patient XXX was hospitalized last week.

He was given moexipril and vasotec.

He is male, 65 years old, with a history of heart disease.

Patient has been taking vasotec for several years.

In the past was given candesartan.

I’ve decided not to include a compulsory conversion of the text to lower case as there are times where one would like to keep the original case, such as to identify acronyms. however this can be done easily by inputting the lower case form of the arguments into the function like so — fuzzy_replace(med.lower(),text.lower())


Learn how to fix misspelled words to better identify essential text expressions

Photo by Diomari Madulara on Unsplash

Natural Language Processing (NLP) is being used today in many ML tasks and projects in healthcare, finance, marketing and more. Data Scientists often struggle to clean and analyze text data in order to retrieve insight. For most NLP tasks it is common to use techniques such as tokenization, stemming, lemmatization etc.

However, in some cases there is a need to keep the raw text intact and not split it into tokens. For instance, in data De-Identification which is a private case of Named Entity Recognition (NER), a method used for recognizing different entities in a document, the output of the method displays the original text with labels replacing the desired entities.

In these cases, correcting misspelled or erroneous terms can be challenging. This post will explain how to achieve this task using a combination of RegEx and Fuzzy String Matching.

Fuzzy String Matching

Fuzzy string matching is a technique that finds strings which approximately match a given string pattern. The algorithm behind fuzzy string matching uses a distance metric such as the Levenshtein distance to calculate the differences between two strings by identifying the minimum number of alterations needed to be done to convert the first string into the second string. We will use the python library Fuzzywuzzy to perform this task.

Installation and example:

pip install fuzzywuzzyfrom fuzzywuzzy import fuzzfuzz.ratio("appls","apples")#91

For this example we got a similarity score of 91 out of 100, so the words are very similar. Now we can consider what threshold to use in order to decide whether or not to “correct” the original word.

RegEx

RegEx is short for Regular Expressions and is a special text string that specifies a search pattern in text. This pattern is basically a language that defines exactly what to look for in a text string. For example, if we want to extract all characters other than numbers the regex pattern will be:

[^0-9]+

If we want to extract all e-mail addresses the regex pattern will be:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}

We will use the python library re to perform this task.

Installation and example:

pip install reimport restring = "my e-mail is [email protected]"pattern=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'print(re.search(pattern,string).group())# [email protected]

Raw text correction

Getting back to the original problem, what to do when we need to fix erroneous words or phrases but keep the raw text intact and not split it into tokens. Keeping the text intact also means keeping exactly the same spacing, tabs, line breaks, punctuation etc.

Let’s say that during a NER task we want to tag medications given to patients in a hospital. This information is available in the doctors’ notes section of the Electronic Health Records (EMR) system. Here is one entry for example:

The patient XXX was hospitalized last week.

He was given moxiperil.

He is male, 65 years old, with a history of heart disease.

The correct name of the medication is “moexipril”. In order to correct this we need to:

  1. Prepare a list of keywords we want to search for, in this case only one keyword
  2. Decide on a similarity threshold (default is 85)
  3. Split the text into tokens
  4. Run Fuzzy matching between the keyword and each token
  5. If the similarity score exceeds the pre-determined threshold then replace the token with the keyword
  6. Put it all back together

We can do this using a function as follows:

from fuzzywuzzy import fuzz
import re
def fuzzy_replace(keyword_str, text_str, threshold=85):
l = len(keyword_str.split())
splitted = re.split(r'(\W+)',text_str) #split, keep linebreaks
for i in range(len(splitted)-l+1):
temp = "".join(splitted[i:i+l])
if fuzz.ratio(keyword_str, temp) >= threshold:
before = "".join(splitted[:i])
after = "".join(splitted[i+l:])
text_str= before + keyword_str + after
splitted = re.split(r'(\W+)',text_str)
return text_str

After running this function the text output is now corrected and the text original structure is preserved:

The patient XXX was hospitalized last week.

He was given moexipril.

He is male, 65 years old, with a history of heart disease.

Let’s look now at a more complex example. This time the medical notes contain a few different medications that we’d like to correct and also some occur in the text more than once. To address this issue we define a list that holds all the correct names of the medications and simply loop through the text to find what needs to be corrected. The following code snippet shows how to achieve this:

meds = ["moexipril", "vasotec", "candesartan"]
text = """The patient XXX was hospitalized last week.He was given moxiperil and vasotek.He is male, 65 years old, with a history of heart disease.Patient has been taking vasotek for several years.In the past was given candasarta."""
for med in meds:
text = fuzzy_replace(med, text)

The result is the same text with all medication names corrected.

The patient XXX was hospitalized last week.

He was given moexipril and vasotec.

He is male, 65 years old, with a history of heart disease.

Patient has been taking vasotec for several years.

In the past was given candesartan.

I’ve decided not to include a compulsory conversion of the text to lower case as there are times where one would like to keep the original case, such as to identify acronyms. however this can be done easily by inputting the lower case form of the arguments into the function like so — fuzzy_replace(med.lower(),text.lower())

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment