Improving the NER model with patent texts | SpaCy, Prodigy and a bit of magic 🪄 | by Nikita Kiselov | Jun, 2022

By Jessie Hobb On Jun 2, 2022

Named-entity recognition (NER) — is a subtask of information extraction that seeks to locate and classify named entities … into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

… NER is also known simply as entity identification, entity chunking and entity extraction.

In a nutshell, we recognise meaningful entities in the text and classify them according to the categories.

Why is it useful?

It can be helpfull in many areas:

HR (Speed up the hiring process by summarizing CVs)
Healthcare (extracting essential information from lab reports)
Recommendation engines (check this story from Booking.com!)

How it could be improved?

Datasets with the most popular entity categories already exist, like Organisations or Brands. But what about a custom dataset? Usually, people manually annotate the dataset, which can be expensive, long in time, and quality still depends on the initial data.

In this article, I propose another approach. I use already categorised text, which is highly saturated with relevant terms, to train the NER pipeline for the specific domain.

Photo by Markus Winkler on Unsplash

A patent, or text of the patent to be precise, is a document that provide highly specific information on the patented invention. In other words, it describes the invention in terms of the functionality in its domain. This means that the patent will contain terms specific to that field, and the brevity of the text contributes to the ease of data retrieval.

One of the sources, for example, is Google patents which provide access to the International Patent Classification database:

But more importantly, it is Copyright Free. Accoding to USPTO

Subject to limited exceptions reflected in 37 CFR 1.71(d) & (e) and 1.84(s) , the text and drawings of a patent are typically not subject to copyright restrictions.

That means we can freely use this text to train our model without being afraid of copyright claims, which is crucial for building commercial systems.

For the experiments, I chose the G06K (Recognition of data/Presentation of data) subsection of patents, whereas G06 — is computing /calculating section. Such text should be helpful in training the recognition of technical entities specific to the data analysation field, such as Computer Vision, Signal processing, etc.

Since the patent text is specific to its domain, we need to extract every frequent Named entity from it. To do that, we can use, for example, Wikicorpus. In project’s repository you would find already curated list called manyterms.lower.txt .

To extract relevant terms from the text, we can use CountVectorizer from scikit-learn. In such way, we can remove less frequent terms than some threshold and left terms with more mentions and more sentences to train, respectively.

Extracting entities from patent texts using CountVectorizer

After extracting potential terms, we can move to train the model. Here I used SpaCy library since it is simple to prepare the NER pipeline and has comprehensive documentation. But before, we need to make the dataset in the appropriate format.

Here I used the library’s PhraseMatcher class to find the entities from the pre-defined Wiki list.

Then, label each entity with the custom tag TECH and save it in the SpacCy format.

When train and val. sets are ready, we can start training the model. Since SpaCy is tailored for production, the configuration is very extensive. However, for a simple example, the basic one will suffice. To do this, just go to the website, select the model type and download base-config.cfg .

Config options that we used | Screenshot from spacy.io

After that, initialize the full config with the following command:

Image by the Author | Output after config initialisation

After this command, spacy fills all other config`s parameters and prepares for the training. Now we can run a command to train:

Image by the Author | Output during training

Voilà, our NER model is trained! Now we can see the results.

Important to note! The trained NER model will learn to label entities not only from the pre-labelled training data. It will learn to find and recognise entities also depending on the given context.

Just load model-best and run on the desired text input to run the test. SpaCy provides a nice visualisation of NER labelling , supported inside the Jupyter notebook.

Code to run model inference on the text sample

Image by the Author | Output of the code above

⚠️ Prodigy is paid scriptable annotation tool widely used in the industry, developed by the SpaCy team and integrated into their NLP library. It’s not a crucial part of creating the NER model, but it still can be helpful for some people. ⚠️

There is no limit to perfection, so that we can improve further our NER model with Prodigy. One of its cool features is Active learning. In a nutshell, this means that we can tune the model on the manually labelled data. However, since our model is already trained, we just need a bunch of annotated examples , not the entire dataset.

How to choose examples? Here is why prodigy helpful. It automatically give examples from test set with the lowest model’s confidence score.

Image by the Author | Command to run Active learning with Prodigy

This command will open the annotation window. Here we can review labels and fix them.

Image by the Author | Prodigy annotation window

These annotations are automatically saved to the prodigy dataset. After finishing the annotation of a few examples, we can now run fine-tuning the model. Because of the integration with SpaCy, we can just run training of our model straight from the prodigy.

Command to run model tuning with prodigy annotated data

That’s it! We trained the NER model, which can now recognise and label entities from the specific domain. Utilising the dataset of patents makes it available to quickly prepare the data without manually curating it.

The full code and parsed data can be found here:

P.S.

NER is a crucial step for creating Entity linking, which allows creating hierarchy and connections between recognised entities. Model validation is no less complicated in this case. But this will be described in my next post 😉