Contextual Text Correction Using NLP | by Arun Jagota | Jan, 2023

By Jessie Hobb On Jan 18, 2023

Detecting and correcting errors that involve modeling context

In the previous article, we discussed the problem of detecting and correcting common errors in text using methods from statistical NLP:

There we took an inventory of several issues, with accompanying real examples and discussion. Below are the ones that we had not fully resolved in that post. (The last two were not even touched.) These are the ones needing handling of context.

Missing commas.
Missing or incorrect articles.
Using singular instead of plural, or vice-versa.
Using the wrong preposition or other connective.

In this post, we start with issues involving articles. We look at elaborate examples of this scenario and delve into what we mean by “issues” on each.

We then describe a method that addresses them. It uses a key idea of self-supervision.

We then move on to the various other scenarios and discuss how this same method addresses them as well. Albeit with some slightly different specifications of the outcomes for the self-supervision, and slightly different preprocessing.

Issues Involving Articles

Consider these examples.

… within matter of minutes …
… capitalize first letter in word …

In the first sentence, there should be an a between within and matter. In the second sentence, there should be a the right after capitalize and another one right after in.

Consider this rule.

If the string is 'within matter of'
Insert 'a' immediately after 'within'.

Would you agree this rule makes sense? Never mind its narrow scope of applicability. As will become clear later, this will hardly matter.

If you agree, then within and matter of are the left and the right contexts respectively, for where the a should be inserted.

We can succinctly represent such a rule as LaR which should be read as follows. If the left context is L and the right context is R, then there should be an a between the two. In our setting, L and R are both sequences of tokens, perhaps bounded in length.

As will become clear in the paragraph that follows, in fact, it is better to express this rule in a somewhat generalized form as LMR.

Here M denotes a fixed set of possibilities defining exactly the problem we are trying to solve. In our case, we might choose M to be the set { a, an, the, _none_ }.

We would read this rule as “if the left context is L and the right context is R, then there are four possibilities we want to model. _None_, meaning there is no article in between L and R, and the other three for the three specific articles.

What we are really doing here is formalizing the problem we want to be solved as a supervised learning problem with specific outcomes, in our case M. This will not require any human labeling of the data. Just defining M.

What we are really doing is self-supervision. We can define as many problems as we want, for different choices of M. (In fact, in this post we will address a few more.) Then we can apply the power of supervised learning without having to incur the cost of labeling data. Very powerful.

Let’s see an example. Consider M = {_none_, _a_, _the_, _an_ }. Say our training set has exactly one sentence in it. The underscores are there just for readability — to distinguish between the outcomes in M and the other words in the text.

John is a man.

We’ll further assume that our rule does not cross sentence boundaries. This is a reasonable assumption. Nothing in the modeling depends on this assumption so it can always be relaxed as needed.

From this one-sentence corpus we will derive the following labeled data:

John _none_ is a man
John is _none_ a man
John is a _none_ male
…
John is _a_ man

On each line, the word flanked by the underscores is an outcome in M, the words to the left its left context, and the words to the right its right context.

For instance,

John is _a_ man

says that if the left context is [John, is] and the right context is [man], then there is an a between the left and the right contexts. So this labeled instance captures where the article should be and what its identity should be.

The remaining instances capture the negatives, i.e. where the article should not be.

Once we have such labeled data sets we can in principle use any suitable supervised learning method to learn to predict the label from the input (L, R).

In this post, we will focus on a particular supervised learning method that we think is a strong fit for our particular supervised learning problem. It is a for-purpose method that models L and R as sequences of tokens.

The reader might ask, why not use the latest and greatest NLP techniques for this problem as they handle very elaborate scenarios? Such as recurrent neural networks, transformers, and most recently large language models such as ChatGPT. Perhaps even Hidden Markov Models or Conditional Random Fields. (For more on elaborate language models, see [6] and [7].) Some if not all of them should work very well.

There are tradeoffs. If one is trying to solve these problems for the long term, perhaps to build a product around it, e.g., Grammarly [3], the latest and greatest methods should of course be considered.

If on the other hand, one wishes to build or at least understand simpler yet effective methods from scratch, then the method of this post should be considered.

The aforementioned method is also easy to implement incrementally. For readers who want to give this a try, check out the section Mini Project. The project described there could be executed in a few hours, tops a day. By a programmer well-versed in Python or some other scripting language.

The Method

First, let’s describe this method for the particular problem of missing or incorrect articles. Following that we will apply it to several of the other issues mentioned earlier in this post.

Consider LMR. We will work with a probability distribution P(M|L, R) attached to this rule. P(M|L, R) will tell us which of the outcomes in M is more likely than others in the context of (L, R).

For instance, we would expect P(a|L=John is, R=man) to be close to 1 if not 1.

P(M|L, R) can be learned from our training data in an obvious way.

P(m|L, R) = #(L,m,R)/sum_m’ #(L,m’,R)

Here #(L, m’, R) is the number of instances in our training set in which the label on input (L, R) is m’. Note that if m’ is _none_ then R begins right after L ends.

Let’s say our training corpus now has exactly two sentences.

John is a man.
John is the man.

P(a|L=John is, R=man) would be ½ since there are two instances of this (L, R) of which one is labeled a, the other the.

Generalizing, in the ML Sense

Consider the labeled instances

John is _a_ man.
Jack is _a_ man.
Jeff is _a_ man.

If our corpus had enough of these, we’d want our ML to be able to learn the rule

is _a_ man

i.e., that P(a|L=is, R=man) is also close to 1. Such a rule would generalize better as it is applicable to any scenario in which the left context is is and the right context is man.

In our approach, we will address this as follows.

Say LmR is an instance in the training set. Below we’ll assume L and R are sequences of tokens. In our setting, the tokenization may be based on white space, for instance. That said, our method will work with any tokenization.

From LmR we will derive new training instances L’mR’ where L’ is a suffix of L and R’ a prefix of R. L’ or R’ or both may have zero tokens.

The derived instances will cover all combinations of L’ and R’.

Sure the size of the training set could explode if applied to a large corpus and the lengths of L and R are not bounded. Okay, bound them.

Recap

Okay, let’s see where we are. Consider the examples earlier in this post.

… within matter of minutes …
… capitalize first letter in word …

Assuming our training corpus is rich enough, for example, all of Wikipedia pre-segmented into sentences, we should have no difficulty whatsoever in detecting where the articles are missing in these two sentences, and recommending specific fixes. The sentences that would result from applying these fixes are

… within a matter of minutes …
… capitalize the first letter in the word …

Now consider

… within the matter of minutes …

Using our trained model we can detect that the here should probably be a.

Prediction Algorithm

To this point, we’ve only discussed informally how we might use the learned rules to identify issues, not in fine detail. We now close this gap.

Consider a window LmR on which we want to compare m with the predictions from the rules that apply in this situation. For example, were LmR to be

… within the matter of minutes …

we’d want to predict off the rules L’ _the_ R’, where L’ is [within] or [], and R’ is [matter, of, minutes], [matter, of], [matter], or [] and from these predictions somehow come up with a final prediction.

The approach we will take is the following. We assume that we are given some cutoff, call it c, on the minimum value that P(m’|L, R) needs to be for us to surface that our method predicts m’ in the context (L, R).

We will examine our rules in order of nonincreasing|L’|+|R’|. Here |T| denotes the number of tokens in a list T. We will stop as soon as we find some m’ for some L’, R’ such that P(m’|L’, R’) is at least c.

In plain English, we are doing this. Among all the rules that apply to a particular situation, we are finding one that is sufficiently predictive of some outcome in M and is also the most general among those that do.

Try These

Consider these examples, also from https://en.wikipedia.org/wiki/Shannon_Lucid

I have removed the articles. I would like the reader to guess where an article should go and what it should be: the, a, or an.

… included trip to …
… had different payload …
… on wide variety of …
… was teaching assistant …
… and bought preowned Piper PA-16 Clipper …
… as graduate assistant in Department of Biochemistry and 
Molecular Biology …
… transferred to University of Oklahoma …

Look below only after you have made all your predictions.

The instances from which these were derived were

… included a trip to …
… had a different payload …
… on a wide variety of …
… was a teaching assistant …
… and bought a preowned Piper PA-16 Clipper …
… as a graduate assistant in the Department of Biochemistry and 
Molecular Biology …
… transferred to the University of Oklahoma …

How good were your predictions? If you did well, the method described so far in this post would also have worked well.

Mini Project

If you are interested in a mini project that could be implemented in hours, consider this. Write a script, probably just a few lines, to input a text document and output a labeled training set. Then inspect the labeled training set to get a sense of whether it contains useful instances for the prediction of the locations and the identities of the articles.

If your assessment shows potential and you have the time, you can then consider taking it further. Perhaps use an existing ML implementation, such as from scikit-learn, on the training set. Or implement the method from scratch.

Now some more detail will help with your script. Consider limiting the context to L and R to be exactly one word each. Scan the words in the document in sequence, and construct the negative and positive instances on the fly. Ignore sentence boundaries unless you have access to an NLP tool such as NLTK and can use its tokenizer to segment the text into sentences.

Deposit the constructed instances incrementally into a pandas data frame of three columns L, R, M. M is the set we chose in this section. Output this data frame to a CSV file.

How to get a reasonable training set for your script? Download a Wikipedia page or two by copying and pasting.

Issues Involving Commas

Next, let’s turn our attention to issues involving commas. In [1] we covered some simple scenarios. The ones below are more nuanced.

Consider this from https://en.wikipedia.org/wiki/Zork

In Zork, the player explores …

First off, let’s observe that to apply our method, we should retain the comma as a separate token. Then the problem looks like the one we addressed earlier, the one on articles. It would make sense to choose M = {_comma_, _none_}. That is, the supervised learning problem is to predict whether there is a comma or not in the context (L, R).

From what we have seen so far, while the rules we learn might be effective, they may not generalize adequately. This is because the last token of the left context would be Zork. We are not really learning the general pattern

In <token of a certain type> _comma_ the

Is there a straightforward way to generalize our method so it can learn more general rules?

The answer is yes. Here is how.

We’ll introduce the concept of an abstracted token. We’ll start with a single abstraction that is relevant to our example. Later in this post, we’ll introduce other abstractions as needed.

We’ll assume the word on which this abstraction is applied contains only characters from a through z. That is, no digits; no special characters.

This abstraction will produce one of three strings: /capitalized/ denoting that the word begins with a letter in upper case followed by zero or more letters in lower case, /all_lower/ denoting that all the letters in the word are in lower case, and /all_caps/ denoting that all the letters in the word are in upper case.

Next, we will derive new sequences of tokens from the existing ones by selectively applying this abstraction operator.

Let’s elaborate on “selectively”. If for every token in the sequence, we considered two possibilities, the original token or the abstracted one, we would get a combinatorial explosion of generated sequences.

To mitigate this issue, we will only abstract out the tokens that occur sufficiently infrequently if at all in our training set. Or abstract out only those that yield /capitalized/ or /all-caps/.

Below is the sequence we may derive from In Zork, the …

In /capitalized/, the

We only abstracted Zork as it is both capitalized and an uncommon word.

Now imagine that we add, to the training set, new labeled instances derived from the abstracted sequences. The label is the one associated with the original sequence.

In our example, the derived labeled instance would be

In /capitalized/ _comma_ the

Now we train our algorithm exactly as before. It will learn the generalized rules as well.

Note that when we say “add new labeled instances to the training set” we are not implying that this needs to be done offline. We can simply add these labeled instances on the fly. This is analogous to what is often done in ML practice.

Extract features from the input on-the-fly

Also, note that we described our method as “adding new labeled instances” only because we felt it was useful to explain it this way. We can view this alternately as if we did not add new labeled instances but merely extracted additional features.

This is because all the newly-added instances have the same label — the original one. So we can collapse them all into the original instance, just with additional features extracted.

More Nuanced Examples

Now consider these examples from https://en.wikipedia.org/wiki/Shannon_Lucid

Due to America’s ongoing war with Japan, when she was six weeks old, 
the family was detained by the Japanese.They moved to Lubbock, Texas, and then settled in Bethany, Oklahoma, the 
family's original hometown, where Wells graduated from Bethany High School 
in 1960.
She concluded that she had been born too late for this, but discovered 
the works of Robert Goddard, the American rocket scientist, and decided 
that she could become a space explorer.

These ones are more intricate.

Nonetheless, we will continue with our method for the reasons we mentioned earlier in the post. One is that a basic yet meaningful version can be implemented in days if not hours from scratch. (No ML libraries needed.)

To these, we’ll add one more. This method’s predictions are explainable. Specifically, if it detects an issue and makes a recommendation, then the specific rule that was involved can be attached as an explanation. As we’ve seen, rules are generally transparent.

Okay, back to the examples.

Let’s examine the scenarios involving commas in the above examples one by one. We won’t examine all.

With those that we do examine, we will also weigh in on whether we think our current method has a good chance of working as is. These inspections will also generate ideas for further enhancement.

Consider

Due to America’s ongoing war with Japan, when she was six weeks old

The sequence we derive from this is

Due to /capitalized/’s ongoing war with /capitalized/, when she was six 
weeks old

The labeled instances derived from these two sequences also include all combinations of suffices of the left context paired with prefixes of the right context. In the terminology of machine learning, this means that we are enumerating lots of hypotheses in the space of hypotheses (in our setting, the hypotheses are the rules).

The point we are trying to make in the previous paragraph is that by generating lots of hypotheses, we increase the likelihood of finding some rules that are sufficiently predictive.

Of course, there is no free lunch. This impacts the training time and model complexity as well.

This also assumes that we are somehow able to discard the rules we found during this process that turned out to be noisy or ineffective. Specifically, those that were either insufficiently predictive or could be covered by more general rules that are equally predictive.

In a section downstream in this post, we will address all these issues. That said, only an empirical evaluation over a wide range of scenarios would ultimately reveal how effective our approach is.

Back to this specific example. First, let’s see it again.

Due to /capitalized/’s ongoing war with /capitalized/, when she was six 
weeks old

There is a fair chance our method will work adequately as-is. If not on this particular one, then at least on similar examples. Furthermore, nothing specific comes to mind in terms of enhancements. So let’s move on to other examples.

Next, consider

when she was six weeks old, the family was detained by the Japanese.

We think the current method, as is, is likely to work for this. Why? Consider

… when she was six weeks old the family was detained by …

Would you not consider inserting a comma between old and the based on this information alone? (I do mean “consider”.)

If you would, the algorithm could also work well. It sees the same information.

Next, consider

They moved to Lubbock, Texas
then settled in Bethany, Oklahoma

The abstraction we presented earlier, which abstracts certain words out into /capitalized/, /all_lower/, or /all_caps/ should help here.

If it doesn’t help adequately, we can tack on a second, finer abstraction. Specifically, involving detecting the named entities city and state. These would let us derive two new sequences.

They moved to /city/, /state/
then settled in /city/, /state/

Even More Nuanced Cases

Below are even more nuanced examples of issues involving commas. These are also from https://en.wikipedia.org/wiki/Shannon_Lucid

Originally scheduled as one mission, the number of Spacelab Life Sciences 
objectives and experiments had grown until it was split into two 
missions,[57] the first of which, STS-40/SLS-1, was flown in June 1991.To study this, on the second day of the mission Lucid and Fettman wore 
headsets, known as accelerometer recording units, which recorded their 
head movements during the day. Along with Seddon, Wolf and Fettman, Lucid 
collected blood and urine samples from the crew for metabolic experiments.

These suggest that we probably need to allow for quite long left and right contexts, possibly up to 20 words each. And maybe add more abstractions.

Keeping abstractions aside, how will this impact our model training? First of all, since we are learning a complex model, we’ll need our training set to be sufficiently large, rich, and diverse. Fortunately, such a data set can be assembled without much effort. Download and use all of Wikipedia. See [9].

Okay, now onto training time. This can be large as we have a huge training set combined with a complex model we are trying to learn, one involving lots and lots of rules. Of course, the learned model itself is potentially huge, with perhaps the vast majority of the learned rules turning out to be noisy.

Later on in this post, we will discuss these challenges in detail and how to mitigate them. In particular, we will posit specific ways to weed out rules that are insufficiently predictive or those that can be covered with more general rules that remain sufficiently predictive.

For now, let’s move on to the next use case, which is

Issues Involving Prepositions Or Other Connectives

Now consider these examples, also from https://en.wikipedia.org/wiki/Shannon_Lucid which I have mutated slightly. Specifically, I replaced certain connectives with others that are somewhat plausible though not as good.

… participated on biomedical experiments …
… satellites were launched in successive days …
… initiated its deployment with pressing a button …

Can you spot the errors and fix them?

Below are the original, i.e. correct, versions.

… participated /in/ biomedical experiments …
… satellites were launched /on/ successive days …
… initiated its deployment /by/ pressing a button …

If you did well, so will the method.

Now to the modeling. We will let M denote the set of connectives we wish to model. M could be defined, for example, by the words tagged as prepositions by a certain part-of-speech tagger. Or some other way.

Regardless, we will need to ensure that we can determine with certainty and reasonably efficiently whether or not a particular token is in M.

This is because during training, while scanning a particular text, we will need to know, for every word, whether it is an instance of M or not.

To keep things simple, we will leave _none_ out of M. This means that we will only be able to model replacement errors, i.e., using the wrong connective. It is easy to add _none_ in but it clutters up the description a bit.

Singular Versus Plural

Consider these examples, with the words we want to examine for the so-called grammatical number highlighted in bold.

As we’ve seen, for some of the /problems/ we are trying to solve, we may 
need long left and right /contexts/. Up to 20 /words/ each. Perhaps longer.We've also discussed that we'd preferably want a very rich data /set/ for 
training.

First off, let’s ask how we would even detect the words in /…/ in an automated fashion. Here is a start. We could run a part-of-speech tagger and pick up only nouns.

Let’s try this out on our examples. Using the part-of-speech tagger at https://parts-of-speech.info/ we get

The color codes of the various parts of speech are below.

This, while not great, seems good enough to start with. It got problems, contexts, and words correctly. It had a false positive, and, and a false negative, set. It also picked up training which perhaps we don’t care about.

As we will discuss in more detail later, while the false positives may yield additional irrelevant rules, these will tend not to be harmful, only useless. Furthermore, we’ll catch them during the pruning phase.

That said, if we are concerned about accuracy upfront, we might consider a more advanced part-of-speech tagger. Or some other way to refine our detection approach. We won’t pursue either in this post.

Next, we’ll do a type of preprocessing we haven’t yet had to do in any of our use cases discussed thus far. Say the procedure we described in the previous paragraph detects a particular word that is the object of our study. By “object of our study” we mean whether it should be in singular or in the plural.

Right after we have detected such a word, we will run a grammatical number classifier, possibly one using a very simple heuristic such as if the word ends with s or ies deem it plural else deem it singular. Next, we will add _singular_ or _plural_ to a copy of our text, depending on this classifier’s prediction. Importantly, we will also singularize the word which precedes the label.

In our examples, after all of this has been done, and using the part-of-speech tagger we used earlier, we will get

As we’ve seen, for some of the *problem* _plural_ we are trying to solve, 
we may need long left and _singular_ right *context* _plural_. Up to 20 
*word* _plural_ each. Perhaps longer.

So our M will be the set { _singular_, _plural_ }.

Note that the left context includes the word whose grammatical number we are trying to predict. This is by design. This is why we added the labels explicitly to the text.

Also, note that the words flanked by asterisks are the ones we singularized. We did so because these words are in the left context of the label to be predicted. We want to strip off any information in the word itself that can be used to predict its label. Other than any information inherently in the singularized version of the word.

If we didn’t singularize these words we would have label leakage. This would have bad consequences. We might learn rules that seem to be good but don’t work well at prediction time.

Next, let’s do a quick review of the text as a sanity check. To assess whether or not the contexts seem to have enough signal to at least predict better than random. How accurately we can predict the labels will have to await an empirical evaluation.

It does seem that for some of the problems predicts _plural_. left and right context would also seem to predict _plural_ better than random. How much better is hard to say without seeing more examples. Similarly, Up to 20 word would seem to predict _plural_. The prediction might possibly improve, and certainly generalize better, were we to use the abstraction that 20 is _integer_greater_than_1_.

Model Complexity, Training Time, And Lookup Time

As we’ve seen, for some of the problems we are trying to solve, we may need long left and right contexts. Up to 20 words each. Perhaps longer.

We’ve also discussed that we’d preferably want a very rich data set for training. Such as all of Wikipedia. Our mechanism also relies on abstractions, which amplify the size of the training set possibly by another order of magnitude.

So is this a show-stopper for our method? Well, no. We can substantially prune the size of the model and substantially speed up training. We’ll discuss these below individually. We’ll also discuss how to be fast at what we are calling lookup time, as it will impact both training time and prediction time.

Reducing The Model Size

Let’s start with the model size. First off, keep in mind that in modern times large-scale real models do use billions of parameters. So we may be okay even without any pruning. That said, we’ll cover it anyhow.

When considering whether a particular rule should be deleted or not, we will distinguish between two cases.

Is the rule insufficiently predictive?
Is a more general rule sufficiently predictive compared to this one?

Our main reason for distinguishing between these two cases is that we will not explicitly prune for the first case. Instead, we will rely on either the second case addressing the first one as well or on the prediction algorithm doing on-the-fly pruning sufficiently well. With regards to the latter, also note that the prediction algorithm takes the cutoff c as a parameter, which allows us to get more conservative or more sensitive at prediction time.

Okay, with that out of the way, let’s address the second case.

To explain this method, let’s start with a learned rule LMR that is general enough. Here is an example.

from M learned

We deem it general because the left and the right contexts are a single word each.

Imagine that in the training corpus, the expression from a learned model appears at least once somewhere. So we would also have learned the rule

from M learned model

This rule is more specific. So we will deem it a child of the rule

from M learned

Now that we have defined child-parent relationships we can arrange the rules into a tree.

Now we are ready to describe the pruning criterion. For a particular node v in the tree, if all its descendants predict the same outcome as v does, we will prune away all the nodes under v’s subtree.

Let’s apply this to our example. In the setting M = {_a_, _an_, _the_, _none_}, the rule

from M learned model

predicts the same outcome, _a_, as does

from M learned

Furthermore imagine that the latter rule only has one rule, the former one, in its subtree. So we prune away the former.

Okay, we’ve defined the pruning criterion. Next, we discuss how to do the actual pruning, i.e. apply the criterion efficiently. The short answer is bottom-up.

We start with the leaves of the tree and find their parents. We then consider each of these parents one by one. We prune away a parent’s children if they all predict the same outcome as the parent.

We now have a new tree. We repeat this same process on it.

We stop when we can’t prune further or when we have pruned enough.

Speeding Up Training

On the one hand, we only need one pass over the sentences in the training set. Moreover, we only need to stop at the tokens that are instances of M. To pause and update various counters as described earlier. That’s good.

On the other hand, at a particular stopping point m, we may need to enumerate all admissible windows LmR so we can increment their counters involving m. For each of these, we also need to derive additional windows based on the abstractions we are modeling.

We’ve already discussed how to constrain the abstractions, so we won’t repeat that discussion here.

The key point we’d like to bring out is that pruning the model in the manner we described earlier not only reduces the model’s size, it also speeds up subsequent training. This is because, at any particular stopping point m, there will generally be far fewer rules that trigger in the pruned model compared to the unpruned one.

Lookup Time

By lookup, we mean that we want to efficiently look up rules that apply in a particular situation. Let’s start with an example. Say we have learned the rule

is M man

for the issue involving articles. Recall that we chose M to be { a, an, the, _none_ }.

Now consider the text Jeremy is a man. We want to scan it for issues. We will be on a since a is in M. We want to check the following, in order. For this M, is there a rule with L = [is] and R = []? Is there a rule with L = [] and R = [man]? Is there a rule with L = [is] and R = [man]? And so on. Let’s call “Is there a rule” a look-up. The look-up inputs M, L, and R.

We obviously want the lookups to be fast. We can make this happen by indexing the set of rules in a hashmap, let’s call it H. H is keyed on the triple (M, L, R). Think of H as a 3d hashmap, expressed as H[M][L][R].

Summary

In this post, we covered elaborate versions of scenarios involving detecting and correcting errors in text. By “elaborate” we mean those in which context seems important. We covered issues involving missing or incorrect articles, missing commas, using singular when it should be plural or the other way around, and using the wrong connective such as a wrong preposition.

We modeled each as a self-supervised learning problem. We described an approach that works on all these problems. It is based on a probability distribution on the space of outcomes conditioned jointly over a left context and a right context. The definition of the outcomes and some preprocessing do depend on the particular problem.

We discussed enumerating left context, and right context pairs of increasing length, and also abstraction mechanisms to learn more general rules.

The method we described is straightforward to implement in its basic form.

We also described how to prune the set of learned rules, how to speed up training, and how to efficiently look up which of the rules apply to a particular situation.

References

Text Correction Using NLP. Detecting and correcting common errors… | by Arun Jagota | Jan, 2023 | Towards Data Science
Association rule learning — Wikipedia
Grammarly I used it extensively. Very useful.
ChatGPT: Optimizing Language Models for Dialogue
Wikipedia:Database download
Statistical Language Models | by Arun Jagota | Towards Data Science | Medium
Neural Language Models, Arun Jagota, Towards Data Science, Medium

Detecting and correcting errors that involve modeling context

In the previous article, we discussed the problem of detecting and correcting common errors in text using methods from statistical NLP:

Missing commas.
Missing or incorrect articles.
Using singular instead of plural, or vice-versa.
Using the wrong preposition or other connective.

In this post, we start with issues involving articles. We look at elaborate examples of this scenario and delve into what we mean by “issues” on each.

We then describe a method that addresses them. It uses a key idea of self-supervision.

Issues Involving Articles

Consider these examples.

… within matter of minutes …
… capitalize first letter in word …

In the first sentence, there should be an a between within and matter. In the second sentence, there should be a the right after capitalize and another one right after in.

Consider this rule.

If the string is 'within matter of'
Insert 'a' immediately after 'within'.

Would you agree this rule makes sense? Never mind its narrow scope of applicability. As will become clear later, this will hardly matter.

If you agree, then within and matter of are the left and the right contexts respectively, for where the a should be inserted.

As will become clear in the paragraph that follows, in fact, it is better to express this rule in a somewhat generalized form as LMR.

Here M denotes a fixed set of possibilities defining exactly the problem we are trying to solve. In our case, we might choose M to be the set { a, an, the, _none_ }.

John is a man.

We’ll further assume that our rule does not cross sentence boundaries. This is a reasonable assumption. Nothing in the modeling depends on this assumption so it can always be relaxed as needed.

From this one-sentence corpus we will derive the following labeled data:

John _none_ is a man
John is _none_ a man
John is a _none_ male
…
John is _a_ man

On each line, the word flanked by the underscores is an outcome in M, the words to the left its left context, and the words to the right its right context.

For instance,

John is _a_ man

The remaining instances capture the negatives, i.e. where the article should not be.

Once we have such labeled data sets we can in principle use any suitable supervised learning method to learn to predict the label from the input (L, R).

If on the other hand, one wishes to build or at least understand simpler yet effective methods from scratch, then the method of this post should be considered.

The Method

First, let’s describe this method for the particular problem of missing or incorrect articles. Following that we will apply it to several of the other issues mentioned earlier in this post.

Consider LMR. We will work with a probability distribution P(M|L, R) attached to this rule. P(M|L, R) will tell us which of the outcomes in M is more likely than others in the context of (L, R).

For instance, we would expect P(a|L=John is, R=man) to be close to 1 if not 1.

P(M|L, R) can be learned from our training data in an obvious way.

P(m|L, R) = #(L,m,R)/sum_m’ #(L,m’,R)

Here #(L, m’, R) is the number of instances in our training set in which the label on input (L, R) is m’. Note that if m’ is _none_ then R begins right after L ends.

Let’s say our training corpus now has exactly two sentences.

John is a man.
John is the man.

P(a|L=John is, R=man) would be ½ since there are two instances of this (L, R) of which one is labeled a, the other the.

Generalizing, in the ML Sense

Consider the labeled instances

John is _a_ man.
Jack is _a_ man.
Jeff is _a_ man.

If our corpus had enough of these, we’d want our ML to be able to learn the rule

is _a_ man

i.e., that P(a|L=is, R=man) is also close to 1. Such a rule would generalize better as it is applicable to any scenario in which the left context is is and the right context is man.

In our approach, we will address this as follows.

From LmR we will derive new training instances L’mR’ where L’ is a suffix of L and R’ a prefix of R. L’ or R’ or both may have zero tokens.

The derived instances will cover all combinations of L’ and R’.

Sure the size of the training set could explode if applied to a large corpus and the lengths of L and R are not bounded. Okay, bound them.

Recap

Okay, let’s see where we are. Consider the examples earlier in this post.

… within matter of minutes …
… capitalize first letter in word …

… within a matter of minutes …
… capitalize the first letter in the word …

Now consider

… within the matter of minutes …

Using our trained model we can detect that the here should probably be a.

Prediction Algorithm

To this point, we’ve only discussed informally how we might use the learned rules to identify issues, not in fine detail. We now close this gap.

Consider a window LmR on which we want to compare m with the predictions from the rules that apply in this situation. For example, were LmR to be

… within the matter of minutes …

Try These

Consider these examples, also from https://en.wikipedia.org/wiki/Shannon_Lucid

I have removed the articles. I would like the reader to guess where an article should go and what it should be: the, a, or an.

… included trip to …
… had different payload …
… on wide variety of …
… was teaching assistant …
… and bought preowned Piper PA-16 Clipper …
… as graduate assistant in Department of Biochemistry and 
Molecular Biology …
… transferred to University of Oklahoma …

Look below only after you have made all your predictions.

The instances from which these were derived were

… included a trip to …
… had a different payload …
… on a wide variety of …
… was a teaching assistant …
… and bought a preowned Piper PA-16 Clipper …
… as a graduate assistant in the Department of Biochemistry and 
Molecular Biology …
… transferred to the University of Oklahoma …

How good were your predictions? If you did well, the method described so far in this post would also have worked well.

Mini Project

Deposit the constructed instances incrementally into a pandas data frame of three columns L, R, M. M is the set we chose in this section. Output this data frame to a CSV file.

How to get a reasonable training set for your script? Download a Wikipedia page or two by copying and pasting.

Issues Involving Commas

Next, let’s turn our attention to issues involving commas. In [1] we covered some simple scenarios. The ones below are more nuanced.

Consider this from https://en.wikipedia.org/wiki/Zork

In Zork, the player explores …

In <token of a certain type> _comma_ the

Is there a straightforward way to generalize our method so it can learn more general rules?

The answer is yes. Here is how.

We’ll introduce the concept of an abstracted token. We’ll start with a single abstraction that is relevant to our example. Later in this post, we’ll introduce other abstractions as needed.

We’ll assume the word on which this abstraction is applied contains only characters from a through z. That is, no digits; no special characters.

Next, we will derive new sequences of tokens from the existing ones by selectively applying this abstraction operator.

Below is the sequence we may derive from In Zork, the …

In /capitalized/, the

We only abstracted Zork as it is both capitalized and an uncommon word.

Now imagine that we add, to the training set, new labeled instances derived from the abstracted sequences. The label is the one associated with the original sequence.

In our example, the derived labeled instance would be

In /capitalized/ _comma_ the

Now we train our algorithm exactly as before. It will learn the generalized rules as well.

Extract features from the input on-the-fly

This is because all the newly-added instances have the same label — the original one. So we can collapse them all into the original instance, just with additional features extracted.

More Nuanced Examples

Now consider these examples from https://en.wikipedia.org/wiki/Shannon_Lucid

Due to America’s ongoing war with Japan, when she was six weeks old, 
the family was detained by the Japanese.They moved to Lubbock, Texas, and then settled in Bethany, Oklahoma, the 
family's original hometown, where Wells graduated from Bethany High School 
in 1960.
She concluded that she had been born too late for this, but discovered 
the works of Robert Goddard, the American rocket scientist, and decided 
that she could become a space explorer.

These ones are more intricate.

Okay, back to the examples.

Let’s examine the scenarios involving commas in the above examples one by one. We won’t examine all.

With those that we do examine, we will also weigh in on whether we think our current method has a good chance of working as is. These inspections will also generate ideas for further enhancement.

Consider

Due to America’s ongoing war with Japan, when she was six weeks old

The sequence we derive from this is

Due to /capitalized/’s ongoing war with /capitalized/, when she was six 
weeks old

The point we are trying to make in the previous paragraph is that by generating lots of hypotheses, we increase the likelihood of finding some rules that are sufficiently predictive.

Of course, there is no free lunch. This impacts the training time and model complexity as well.

In a section downstream in this post, we will address all these issues. That said, only an empirical evaluation over a wide range of scenarios would ultimately reveal how effective our approach is.

Back to this specific example. First, let’s see it again.

Due to /capitalized/’s ongoing war with /capitalized/, when she was six 
weeks old

Next, consider

when she was six weeks old, the family was detained by the Japanese.

We think the current method, as is, is likely to work for this. Why? Consider

… when she was six weeks old the family was detained by …

Would you not consider inserting a comma between old and the based on this information alone? (I do mean “consider”.)

If you would, the algorithm could also work well. It sees the same information.

Next, consider

They moved to Lubbock, Texas
then settled in Bethany, Oklahoma

The abstraction we presented earlier, which abstracts certain words out into /capitalized/, /all_lower/, or /all_caps/ should help here.

If it doesn’t help adequately, we can tack on a second, finer abstraction. Specifically, involving detecting the named entities city and state. These would let us derive two new sequences.

They moved to /city/, /state/
then settled in /city/, /state/

Even More Nuanced Cases

Below are even more nuanced examples of issues involving commas. These are also from https://en.wikipedia.org/wiki/Shannon_Lucid

Originally scheduled as one mission, the number of Spacelab Life Sciences 
objectives and experiments had grown until it was split into two 
missions,[57] the first of which, STS-40/SLS-1, was flown in June 1991.To study this, on the second day of the mission Lucid and Fettman wore 
headsets, known as accelerometer recording units, which recorded their 
head movements during the day. Along with Seddon, Wolf and Fettman, Lucid 
collected blood and urine samples from the crew for metabolic experiments.

These suggest that we probably need to allow for quite long left and right contexts, possibly up to 20 words each. And maybe add more abstractions.

For now, let’s move on to the next use case, which is

Issues Involving Prepositions Or Other Connectives

… participated on biomedical experiments …
… satellites were launched in successive days …
… initiated its deployment with pressing a button …

Can you spot the errors and fix them?

Below are the original, i.e. correct, versions.

… participated /in/ biomedical experiments …
… satellites were launched /on/ successive days …
… initiated its deployment /by/ pressing a button …

If you did well, so will the method.

Regardless, we will need to ensure that we can determine with certainty and reasonably efficiently whether or not a particular token is in M.

This is because during training, while scanning a particular text, we will need to know, for every word, whether it is an instance of M or not.

Singular Versus Plural

Consider these examples, with the words we want to examine for the so-called grammatical number highlighted in bold.

As we’ve seen, for some of the /problems/ we are trying to solve, we may 
need long left and right /contexts/. Up to 20 /words/ each. Perhaps longer.We've also discussed that we'd preferably want a very rich data /set/ for 
training.

First off, let’s ask how we would even detect the words in /…/ in an automated fashion. Here is a start. We could run a part-of-speech tagger and pick up only nouns.

Let’s try this out on our examples. Using the part-of-speech tagger at https://parts-of-speech.info/ we get

The color codes of the various parts of speech are below.

In our examples, after all of this has been done, and using the part-of-speech tagger we used earlier, we will get

As we’ve seen, for some of the *problem* _plural_ we are trying to solve, 
we may need long left and _singular_ right *context* _plural_. Up to 20 
*word* _plural_ each. Perhaps longer.

So our M will be the set { _singular_, _plural_ }.

Note that the left context includes the word whose grammatical number we are trying to predict. This is by design. This is why we added the labels explicitly to the text.

If we didn’t singularize these words we would have label leakage. This would have bad consequences. We might learn rules that seem to be good but don’t work well at prediction time.

Model Complexity, Training Time, And Lookup Time

As we’ve seen, for some of the problems we are trying to solve, we may need long left and right contexts. Up to 20 words each. Perhaps longer.

Reducing The Model Size

When considering whether a particular rule should be deleted or not, we will distinguish between two cases.

Is the rule insufficiently predictive?
Is a more general rule sufficiently predictive compared to this one?

Okay, with that out of the way, let’s address the second case.

To explain this method, let’s start with a learned rule LMR that is general enough. Here is an example.

from M learned

We deem it general because the left and the right contexts are a single word each.

Imagine that in the training corpus, the expression from a learned model appears at least once somewhere. So we would also have learned the rule

from M learned model

This rule is more specific. So we will deem it a child of the rule

from M learned

Now that we have defined child-parent relationships we can arrange the rules into a tree.

Let’s apply this to our example. In the setting M = {_a_, _an_, _the_, _none_}, the rule

from M learned model

predicts the same outcome, _a_, as does

from M learned

Furthermore imagine that the latter rule only has one rule, the former one, in its subtree. So we prune away the former.

Okay, we’ve defined the pruning criterion. Next, we discuss how to do the actual pruning, i.e. apply the criterion efficiently. The short answer is bottom-up.

We start with the leaves of the tree and find their parents. We then consider each of these parents one by one. We prune away a parent’s children if they all predict the same outcome as the parent.

We now have a new tree. We repeat this same process on it.

We stop when we can’t prune further or when we have pruned enough.

Speeding Up Training

We’ve already discussed how to constrain the abstractions, so we won’t repeat that discussion here.

Lookup Time

By lookup, we mean that we want to efficiently look up rules that apply in a particular situation. Let’s start with an example. Say we have learned the rule

is M man

for the issue involving articles. Recall that we chose M to be { a, an, the, _none_ }.

Summary

We discussed enumerating left context, and right context pairs of increasing length, and also abstraction mechanisms to learn more general rules.

The method we described is straightforward to implement in its basic form.

We also described how to prune the set of learned rules, how to speed up training, and how to efficiently look up which of the rules apply to a particular situation.

References

Text Correction Using NLP. Detecting and correcting common errors… | by Arun Jagota | Jan, 2023 | Towards Data Science
Association rule learning — Wikipedia
Grammarly I used it extensively. Very useful.
ChatGPT: Optimizing Language Models for Dialogue
Wikipedia:Database download
Statistical Language Models | by Arun Jagota | Towards Data Science | Medium
Neural Language Models, Arun Jagota, Towards Data Science, Medium

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.