7 spaCy Features To Boost Your NLP Pipelines And Save Time | by Ahmed Besbes | Aug, 2022

By Jessie Hobb On Aug 24, 2022

I’ve never used spaCy beyond simple named entity recognition tasks. Boy was I wrong.

While I was working on an NLP project lately, I came to revisit the spaCy library and try out many of its core functionalities to perform low-level linguistic tasks.

If you’ve never heard about it, spaCy is a modern natural language processing library that handles +66 languages, provides state-of-the-art speed, and has various components for named entity recognition tasks, part-of-speech tagging, entity linking, and much more.
Besides, it’s easily extensible with custom components and attributes.

In this post, I’m going to share with you 7 tips to make the most out of spaCy and leverage it to industrialize and scale your NLP pipelines.

If you’re a data scientist who’s working on NLP projects and who’s willing to improve his toolbox and learn new powerful tools, you should definitely check this post.

Without much further ado, let’s have a look 🔍

PS*: this is not a beginner introduction to spaCy. I’ll try my best to explain each line of code and concept. However, some elementary notions may be skipped.
→ Here’s a link to get started with spaCy.

I’ve rarely seen a library with such a simple and easy-to-use interface.

Let’s say you want to ingest a bunch of text documents, apply some processing such as tokenization and lemmatization, then predict named entities, part-of-speech (POS) tags, and syntactic dependencies.

If you were to use traditional libraries, this seems like a lot of subsequent functions to call and pipe one after the other.

Using spaCy, however, all of this can be done in 3 lines of code.

import spaCy (duh!)
load a pretrained model
pass a list of texts to the model and extract Doc objects

import spacynlp = spacy.load("en_core_news_sm")
texts = ... # texts loaded in a list
docs = nlp(docs)

What’s inside the Doc object? — I hear you ask.

When you pass a text to the nlp object, spaCy turns it into a Doc object.

This Doc object is the result of a processing pipeline that includes:

Tokenization
Part-of-speech (POS) tagging
Dependency parsing
Named Entity Recognition (NER)
and potentially more custom components that you can design yourself

More specifically, the Doc object is a list of Token objects resulting from the tokenization algorithm. Each token stores the output of each step as an attribute.

Here are some of them:

text: the text of the token
dep_: the syntactic dependency relation
tag_: the part-of-speech tag
lemma_: the base form of the token, with no inflectional suffixes

To learn more about the Token attributes and what you can get out of the Doc object, check out this page.

spaCy is designed to handle a large volume of documents and is intended to be used in industrial contexts. As such, it provides built-in support for multiprocessing.

If you throw a large list of texts inside the nlp object previously you won’t get any speed-up.

To speed up the inference, you’ll have to use the nlp.pipe function, set the n_process argument to leverage multiprocessing and set the batch_size argument to enable batching.

Be careful, batching doesn’t work all the time. This depends on your dataset and the model you’re using. Check out this page the learn more

Enable the components you’re using only

There’s another trick you can use to speed up the spaCy processing pipeline: when you instantiate the spaCy object, only enable the components you’re interested in.

For example, if you want to use a named entity recognizer component, you don’t necessarily need a POS tagger or a dependency parser because these components are independent of the NER component.

You can disable these components using the disable argument when loading a model.

import spacynlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])

There’s one spaCy functionality I overlooked for a long time and never really used despite its incredible efficiency. It’s the Matcher class.

This class is a rule-matching engine that allows you to match sequences of tokens based on patterns. Conceptually, this is similar to regular expressions but this class handles more complex patterns that rely on token annotations (i.e. attributes).

What does a spaCy pattern look like?

A token pattern is a dictionary that combines token attributes, operators, and properties.

If you’re not sure what operators and properties are, stick with me, and let’s look at the following examples.

→ Let’s say we want to match the word “Hello” inside your documents. (pretty straightforward, but fair enough)

To do this, we’d write a pattern that uses the token text attribute.

pattern = [
{"TEXT": "Hello"}
]

Then, this pattern is added to a matcher.

from spacy.matcher import Matchernlp = spacy.load("en_core_web_sm")matcher = Matcher(nlp.vocab)
matcher.add("HelloPattern", [pattern])

Once a spaCy document is passed to the matcher, we’ll get the matches in form of tuples (match_id, token_start, token_end).

doc = nlp("Hello my friend!")matcher(doc)[(10496072603676489703, 0, 1)]

Here’s the full code:

→ If we want to make the pattern case-insensitive, we’d use the LOWER key instead:

{"LOWER": "hello"}

→ Now, let’s say I want to match all the nouns in a document. Simple! Just use the POS key. This is a situation where relying on spaCy tokens’ attribute is useful.

{"POS": "NOUN"}

→ What if want to match all tokens that have “love” or “hate” as a base form (i.e. lemma)? Use the LEMMA key coupled with the IN property.

{"LEMMA": {"IN": ["love", "hate"]}}

→ How would you detect two nouns that occur consecutively? You use the POS attribute and combine it with the {n} operator.

{"POS": "PROPN", "OP": "{2}"}

So far we’ve seen patterns for single tokens only. It actually gets more interesting when you stack multiple patterns together to match sequences of tokens (i.e phrases).

Here are some use-cases where the Matcher class comes in handy:

Match a combination of the two consecutive tokens “buy/sell” and “bitcoin/dogecoin”.

Match sequences of tokens based on their POS attributes. Why is this useful? Imagine that, for whatever reason, you want to match expressions composed of an adjective followed by a noun (happy movie, bad restaurant)

pattern = [
{"POS": "ADJ"},
{"POS": "NOUN"}, 
]

Match dates, emails, URLs, numericals

pattern_email = [["LIKE_EMAIL": True]]
pattern_url = [["LIKE_URL": True]]
pattern_num = [["LIKE_NUM": True]]

Match tokens whose based on their lemma and POS tags

# Matches "love cats" or "likes flowers"
pattern_lemma_pos = [
{"LEMMA": {"IN": ["like", "love"]}}, 
{"POS": "NOUN"}
]

Match tokens based on their length

pattern_length = [
{"LENGTH": {">=": 10}}
]

To learn more about the power of the Matcher engine, have a look at this page. This will give you a detailed overview of the syntax of the attributes, operators, properties, and how you compose them.

You can also play with the rule-based Matcher Explorer which allows you to create token patterns interactively and display the detection on your input texts.

By using the EntityRuler class that spaCy provides, you can build your own named entity recognizer based on a dictionary of named entities.

→ This means that if you need to detect these named entities inside a bunch of documents, you don’t have to implement a search logic or regular expressions to match them. All you need is to pass the patterns to the EntityRuler, add the entity ruler as a component to the pipeline, and pass the documents to spaCy to handle the detection for you.

When using the EntityRuler, you can either inject patterns in a blank model or inject patterns in a model that already has a NER component: in this case, spaCy does the work of combining the predictions of the statistical model with the rule-based patterns.

Let’s see illustrate these two cases;

→ Example #1: adding an entity ruler to a blank model
Start with a blank English model and add patterns to it. In this example, we take a spaCy model that doesn’t have a NER module in it and pass it a list of patterns we would like to detect as named entities.

This is a very simple example that doesn’t seem very useful. In practice, entity rulers are helpful when you add a large list of patterns with different labels (e.g. lists of diseases, chemical compounds, technology terms, very specific organization names, etc.) that pretrained models are not capable (i.e. trained) of detecting.

→ Example #2: adding an entity ruler to a model with a NER component
Start with a spaCy English model that has a trained (i.e. statistical) NER component and add to it a list of patterns of additional entities not previously detected. The underlying goal is to increase the performance of the model.

Here’s a use case where adding an entity ruler improves the predictions of a statistical model: detecting biomedical entities 🧪.

Patterns are constructed and curated from knowledge bases and ontologies, then injected into the spaCy model.

Why not retrain a new NER model?

Although this seems like a valid approach, retraining a model is certainly less efficient because it requires a large volume of data. In fact, the new patterns you’re interested in detecting must occur in different contexts (i.e. sentences) so that the model becomes able to capture them.
Adding these patterns on top of a trained model is usually a quick win.

Where to place the entity ruler, before or after the NER component?

The entity ruler can be added before or after the NER component.

→ If it’s added before the NER component, this one will respect the existing entity spans and adjust its predictions around it. In most cases, this is the recommended way by spaCy.

→ If it’s added after the NER component, the entity ruler will only add spans to doc.ents if they don’t overlap with existing entities.

You can learn more about combining statistical models and entity rulers here.

spaCy provides built-in visualizers (displaCy and displaCy-ent) that are able to display interesting token attributes such as Part-Of-Speech (POS) tags, syntactic dependency graphs, named entities, or spans.

This visualization is done directly from the browser or a jupyter notebook.

I personally use displaCy a lot to visualize named entities in order to debug my models.

→ Here’s a demo of the displaCy Named Entity Visualizer

→ Here’s a demo of the displaCy Visualizer for dependency graphs (scroll over horizontally to visualize the full graph)

Here is a code example that visualizes entities.

If you run it from a notebook, you’ll have the following output:

If you set the style attribute to dep , you’ll end up with the following output:

Bonus for Streamlit users 📍
spaCy visualizers can also be embedded into Streamlit applications. If you’re a fan of Streamlit like me, you can use spacy-streamlit, a package that helps you integrate spaCy visualizations into your web apps.

spaCy provides pretrained models for many languages (+66). For each language, there are models of different sizes depending on their architectures and training corpus.

For example, if you’re interested in the English language, you’ll find these three different models:

en_core_web_sm (12MB)
en_core_web_md (40MB)
en_core_web_lg (560M)

These models vary in memory consumption and (slightly) in accuracy. They all support the same tasks. Interestingly, only the medium and the large models have word vectors (i.e. embeddings).

You can learn about the model here.

You can also view what the community has produced using spaCy here.

spaCy has a vibrant community that does amazing work building open-source projects, training language models, and developing plugins for visualization or advanced linguistic tasks.

Hopefully, you can find these projects referenced on this page.

This is helpful to get started or inspired and in most cases, to solve an issue the official library doesn’t solve yet.

If you made it this far, I’d like to thank you for your time and I hope that you’ve enjoyed these few tips about spaCy as much as I did.

spaCy is extremely powerful and this post is by no means an exhaustive overview of all its features. There’s obviously more to cover between the training pipeline and the integration with the Transformers library.

But this may be the subject of another post.

Anyway, that’ll be all for me today. Until next time! 👋