How to Create a Custom NER in Spacy 3.5 | by Angelica Lo Duca | Apr, 2023

By Jessie Hobb On Apr 25, 2023

Natural Language Processing

A quick tutorial on extracting custom entities from a text

Are you tired of using generic named entity recognition (NER) models that don’t quite fit your specific needs? Look no further! This article will guide you through creating a custom NER in Spacy 3.5.

With a few tweaks and training data, you can have a model that accurately identifies entities specific to your domain or use case. Say goodbye to one-size-fits-all NER models and hello to customized precision. Let’s dive in!

We’ll cover:

A very quick introduction to spaCy and its competitors
Problem setting
Generating a training set
Generating and training the model
Testing your model.

If it’s the first time you’ve heard of spaCy, know it’s a popular open-source library for natural language processing (NLP) in Python. It provides efficient and fast NLP capabilities, such as tokenization, part-of-speech tagging, entity recognition, dependency parsing, and more. SpaCy’s main strength lies in its speed and memory efficiency, making it an ideal choice for large-scale text processing tasks.

Some alternatives to spaCy include:

NLTK (Natural Language Toolkit), one of the oldest and most comprehensive NLP libraries, offers a wide range of tools for text analysis, including sentiment analysis, stemming, and lemmatization.
Stanford CoreNLP supports multiple languages, including English, German, and French, with robust features such as named entity recognition and co-reference resolution.
Spark NLP provides production-grade, scalable, and trainable versions of the latest NLP research for Python, Java, and Scala.

Let’s imagine we have a text from which we want to extract entities (people, places, etc.). If the entities are classic, such as people, places, dates, etc., we can easily use a pre-trained NER made available by spaCy.

However, a pre-trained generic model can no longer extract specific entities from our text. Examples of specific entities are dog breeds, the names of bacteria, etc. We need a model adapted to our domain to recognize this entity type.

The following figure shows the workflow to build a new custom NER model:

We start with a generic, already pre-trained NER model and then adapt it to our domain, providing the model with additional training data.

Therefore, the first thing to do is to build the training set with the texts annotated exactly with the entities to be extracted. We then build the model and train it with our annotated texts.

Finally, we use the new data model to predict new texts.

Now let’s see how to implement the described workflow in Python and spaCy practically.

Start by defining the entity types you want to extract. For example, you could extract the animal type: dog, cat, horse, etc. Then, split your dataset into training and test sets. Annotate only the training set.

Follow the steps described below to generate a training set you can use as input to spaCy:

First, annotate your text. Use https://tecoholic.github.io/ner-annotator/ to perform the annotation.
Export the annotated file, say it annotations.json
Open the annotations.json file and remove the first part, where there are the classes. Keep the JSON consistent (remove {} braces if needed). Save the file. In the example below, remove the classes:

Convert the JSON file to the spaCy format. Use the following code, originally implemented by Zachary Lim in his article.