Techno Blender
Digitally Yours.

How to Create a Custom NER in Spacy 3.5 | by Angelica Lo Duca | Apr, 2023

0 33


Natural Language Processing

A quick tutorial on extracting custom entities from a text

Photo by Max Chen on Unsplash

Are you tired of using generic named entity recognition (NER) models that don’t quite fit your specific needs? Look no further! This article will guide you through creating a custom NER in Spacy 3.5.

With a few tweaks and training data, you can have a model that accurately identifies entities specific to your domain or use case. Say goodbye to one-size-fits-all NER models and hello to customized precision. Let’s dive in!

We’ll cover:

  • A very quick introduction to spaCy and its competitors
  • Problem setting
  • Generating a training set
  • Generating and training the model
  • Testing your model.

If it’s the first time you’ve heard of spaCy, know it’s a popular open-source library for natural language processing (NLP) in Python. It provides efficient and fast NLP capabilities, such as tokenization, part-of-speech tagging, entity recognition, dependency parsing, and more. SpaCy’s main strength lies in its speed and memory efficiency, making it an ideal choice for large-scale text processing tasks.

Some alternatives to spaCy include:

  • NLTK (Natural Language Toolkit), one of the oldest and most comprehensive NLP libraries, offers a wide range of tools for text analysis, including sentiment analysis, stemming, and lemmatization.
  • Stanford CoreNLP supports multiple languages, including English, German, and French, with robust features such as named entity recognition and co-reference resolution.
  • Spark NLP provides production-grade, scalable, and trainable versions of the latest NLP research for Python, Java, and Scala.

Let’s imagine we have a text from which we want to extract entities (people, places, etc.). If the entities are classic, such as people, places, dates, etc., we can easily use a pre-trained NER made available by spaCy.

However, a pre-trained generic model can no longer extract specific entities from our text. Examples of specific entities are dog breeds, the names of bacteria, etc. We need a model adapted to our domain to recognize this entity type.

The following figure shows the workflow to build a new custom NER model:

Image by Author

We start with a generic, already pre-trained NER model and then adapt it to our domain, providing the model with additional training data.

Therefore, the first thing to do is to build the training set with the texts annotated exactly with the entities to be extracted. We then build the model and train it with our annotated texts.

Finally, we use the new data model to predict new texts.

Now let’s see how to implement the described workflow in Python and spaCy practically.

Start by defining the entity types you want to extract. For example, you could extract the animal type: dog, cat, horse, etc. Then, split your dataset into training and test sets. Annotate only the training set.

Follow the steps described below to generate a training set you can use as input to spaCy:

  • First, annotate your text. Use https://tecoholic.github.io/ner-annotator/ to perform the annotation.
  • Export the annotated file, say it annotations.json
  • Open the annotations.json file and remove the first part, where there are the classes. Keep the JSON consistent (remove {} braces if needed). Save the file. In the example below, remove the classes:
Image by Author
  • Convert the JSON file to the spaCy format. Use the following code, originally implemented by Zachary Lim in his article.

Now your training set is stored in a file named train.spacy.

To generate the training model, follow the steps described below:

Image extracted from the Spacy website
  • Download the file by clicking on the bottom-right download button. Save the model in the same folder as the annotations.json.
  • Download the base model you will use to train your data. Open config_base.cfg to see which pre-trained model you are using. The following example downloads the it_core_news_lg model:
python -m spacy download it_core_news_lg
  • Run the following command to initialize the model:
python -m spacy init fill-config base_config.cfg config.cfg
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

The command requires a dev.spacy file containing the test set. If you don’t have a test set, use your training set (train.spacy).

The training process may require some time. At the end of the training process, you should see an output similar to the following one:

Image by Author

Now your model is saved in the output/model-best directory. Load it as follows in a Python script:

nlp = spacy.load('output/model-best') 

Now use your just-trained model to extract some entities:

doc = nlp('My simple text')

spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter

Congratulations! You have just learned how to train your custom model for NER in spaCy!

Creating a custom NER in Spacy 3.5 is an easy process that requires only the right setup and coding knowledge. Now that you know how to create a custom NER with Python and Spacy, you can start developing your models for whatever application you need them for.


Natural Language Processing

A quick tutorial on extracting custom entities from a text

Photo by Max Chen on Unsplash

Are you tired of using generic named entity recognition (NER) models that don’t quite fit your specific needs? Look no further! This article will guide you through creating a custom NER in Spacy 3.5.

With a few tweaks and training data, you can have a model that accurately identifies entities specific to your domain or use case. Say goodbye to one-size-fits-all NER models and hello to customized precision. Let’s dive in!

We’ll cover:

  • A very quick introduction to spaCy and its competitors
  • Problem setting
  • Generating a training set
  • Generating and training the model
  • Testing your model.

If it’s the first time you’ve heard of spaCy, know it’s a popular open-source library for natural language processing (NLP) in Python. It provides efficient and fast NLP capabilities, such as tokenization, part-of-speech tagging, entity recognition, dependency parsing, and more. SpaCy’s main strength lies in its speed and memory efficiency, making it an ideal choice for large-scale text processing tasks.

Some alternatives to spaCy include:

  • NLTK (Natural Language Toolkit), one of the oldest and most comprehensive NLP libraries, offers a wide range of tools for text analysis, including sentiment analysis, stemming, and lemmatization.
  • Stanford CoreNLP supports multiple languages, including English, German, and French, with robust features such as named entity recognition and co-reference resolution.
  • Spark NLP provides production-grade, scalable, and trainable versions of the latest NLP research for Python, Java, and Scala.

Let’s imagine we have a text from which we want to extract entities (people, places, etc.). If the entities are classic, such as people, places, dates, etc., we can easily use a pre-trained NER made available by spaCy.

However, a pre-trained generic model can no longer extract specific entities from our text. Examples of specific entities are dog breeds, the names of bacteria, etc. We need a model adapted to our domain to recognize this entity type.

The following figure shows the workflow to build a new custom NER model:

Image by Author

We start with a generic, already pre-trained NER model and then adapt it to our domain, providing the model with additional training data.

Therefore, the first thing to do is to build the training set with the texts annotated exactly with the entities to be extracted. We then build the model and train it with our annotated texts.

Finally, we use the new data model to predict new texts.

Now let’s see how to implement the described workflow in Python and spaCy practically.

Start by defining the entity types you want to extract. For example, you could extract the animal type: dog, cat, horse, etc. Then, split your dataset into training and test sets. Annotate only the training set.

Follow the steps described below to generate a training set you can use as input to spaCy:

  • First, annotate your text. Use https://tecoholic.github.io/ner-annotator/ to perform the annotation.
  • Export the annotated file, say it annotations.json
  • Open the annotations.json file and remove the first part, where there are the classes. Keep the JSON consistent (remove {} braces if needed). Save the file. In the example below, remove the classes:
Image by Author
  • Convert the JSON file to the spaCy format. Use the following code, originally implemented by Zachary Lim in his article.

Now your training set is stored in a file named train.spacy.

To generate the training model, follow the steps described below:

Image extracted from the Spacy website
  • Download the file by clicking on the bottom-right download button. Save the model in the same folder as the annotations.json.
  • Download the base model you will use to train your data. Open config_base.cfg to see which pre-trained model you are using. The following example downloads the it_core_news_lg model:
python -m spacy download it_core_news_lg
  • Run the following command to initialize the model:
python -m spacy init fill-config base_config.cfg config.cfg
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

The command requires a dev.spacy file containing the test set. If you don’t have a test set, use your training set (train.spacy).

The training process may require some time. At the end of the training process, you should see an output similar to the following one:

Image by Author

Now your model is saved in the output/model-best directory. Load it as follows in a Python script:

nlp = spacy.load('output/model-best') 

Now use your just-trained model to extract some entities:

doc = nlp('My simple text')

spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter

Congratulations! You have just learned how to train your custom model for NER in spaCy!

Creating a custom NER in Spacy 3.5 is an easy process that requires only the right setup and coding knowledge. Now that you know how to create a custom NER with Python and Spacy, you can start developing your models for whatever application you need them for.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment