Elegant prompt versioning and LLM model configuration with spacy-llm
Managing prompts and handling OpenAI request failures can be a challenging task. Fortunately, spaCy released spacy-llm, a powerful tool that simplifies prompt management and eliminates the need to create a custom solution from scratch.
In this article, you will learn how to leverage spacy-llm to create a task that extracts data from text using a prompt. We will dive into the basics of spacy and explore some of the features of spacy-llm.
spaCy and spacy-llm 101
spaCy is a library for advanced NLP in Python and Cython. When dealing with text data, several processing steps are typically required, such as tokenization and POS tagging. In order to execute these steps, spaCy provides the nlp method, which invokes a processing pipeline.
spaCy v3.0 introduces config.cfg, a file where we can include detailed settings of these pipelines.
config.cfg uses confection, a config system which allows the creation of arbitrary object trees. For instance, confection parsers the following config.cfg:
[training]
patience = 10
dropout = 0.2
use_vectors = false
[training.logging]
level = "INFO"
[nlp]
# This uses the value of training.use_vectors
use_vectors = ${training.use_vectors}
lang = "en"
into:
{
"training": {
"patience": 10,
"dropout": 0.2,
"use_vectors": false,
"logging": {
"level": "INFO"
}
},
"nlp": {
"use_vectors": false,
"lang": "en"
}
}
Each pipeline use components, and spacy-llm stores the pipeline components into registries using catalogue. This library, also from Explosion, introduces function registries that allow for efficient management of the components. A llmcomponent is defined into two main settings:
- A task, defining the prompt to send to the LLM as well as the functionality to parse the resulting response
- A model, defining the model and how to connect to it
To include a component that uses a LLM in our pipeline, we need to follow a few steps. First, we need to create a task and register it into the registry. Next, we can use a model to execute the prompt and retrieve the responses. Now it’s time to do all that so we can run the pipeline
Creating a task to extract data from text
We will use quotes from https://dummyjson.com/ and create a task to extract the context from every quote. We will create the prompt, register the task and finally create the config file.
1. The prompt
spacy-llm uses Jinja templates to define the instructions and examples. The {{ text }} will be replaced by the quote we will provide. This is our prompt:
You are an expert at extracting context from text.
Your tasks is to accept a quote as input and provide the context of the quote.
This context will be used to group the quotes together.
Do not put any other text in your answer and provide the context in 3 words max.
{# whitespace #}
{# whitespace #}
Here is the quote that needs classification
{# whitespace #}
{# whitespace #}
Quote:
'''
{{ text }}
'''
Context
2. The task class
Now let’s create the class for the task. The class should implement two functions:
- generate_prompts(docs: Iterable[Doc]) -> Iterable[str]: a function that takes in a list of spaCy Doc objects and transforms them into a list of prompts
- parse_responses(docs: Iterable[Doc], responses: Iterable[str]) -> Iterable[Doc]: a function for parsing the LLM's outputs into spaCy Doc objects
generate_prompts will use our Jinja template and parse_responses will add the attribute context to our Doc. This is the QuoteContextExtractTask class:
from pathlib import Path
from spacy_llm.registry import registry
import jinja2
from typing import Iterable
from spacy.tokens import Doc
TEMPLATE_DIR = Path("templates")
def read_template(name: str) -> str:
"""Read a template"""
path = TEMPLATE_DIR / f"{name}.jinja"
if not path.exists():
raise ValueError(f"{name} is not a valid template.")
return path.read_text()
class QuoteContextExtractTask:
def __init__(self, template: str = "quotecontextextract.jinja", field: str = "context"):
self._template = read_template(template)
self._field = field
def _check_doc_extension(self):
"""Add extension if need be."""
if not Doc.has_extension(self._field):
Doc.set_extension(self._field, default=None)
def generate_prompts(self, docs: Iterable[Doc]) -> Iterable[str]:
environment = jinja2.Environment()
_template = environment.from_string(self._template)
for doc in docs:
prompt = _template.render(
text=doc.text,
)
yield prompt
def parse_responses(
self, docs: Iterable[Doc], responses: Iterable[str]
) -> Iterable[Doc]:
self._check_doc_extension()
for doc, prompt_response in zip(docs, responses):
try:
setattr(
doc._,
self._field,
prompt_response.replace("Context:", "").strip(),
),
except ValueError:
setattr(doc._, self._field, None)
yield doc
Now we just need to add the task to the spacy-llm llm_tasks register:
@registry.llm_tasks("my_namespace.QuoteContextExtractTask.v1")
def make_quote_extraction() -> "QuoteContextExtractTask":
return QuoteContextExtractTask()
3. The config.cfg file
We’ll use the GPT-3.5 model from OpenAI. spacy-llm has a model for that so we just need to make sure the secret key is available as an environmental variable:
export OPENAI_API_KEY="sk-..."
export OPENAI_API_ORG="org-..."
To build the nlp method that runs the pipeline we’ll use the assemble method from spacy-llm. This methods reads from a .cfg file. The file should reference the GPT-3.5 model (it’s already in he registry) and the task we’ve created:
[nlp]
lang = "en"
pipeline = ["llm"]
batch_size = 128
[components]
[components.llm]
factory = "llm"
[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.1}
[components.llm.task]
@llm_tasks = "my_namespace.QuoteContextExtractTask.v1"
4. Running the pipeline
Now we just need to put everything together and run the code:
import os
from pathlib import Path
import typer
from wasabi import msg
from spacy_llm.util import assemble
from quotecontextextract import QuoteContextExtractTask
Arg = typer.Argument
Opt = typer.Option
def run_pipeline(
# fmt: off
text: str = Arg("", help="Text to perform text categorization on."),
config_path: Path = Arg(..., help="Path to the configuration file to use."),
verbose: bool = Opt(False, "--verbose", "-v", help="Show extra information."),
# fmt: on
):
if not os.getenv("OPENAI_API_KEY", None):
msg.fail(
"OPENAI_API_KEY env variable was not found. "
"Set it by running 'export OPENAI_API_KEY=...' and try again.",
exits=1,
)
msg.text(f"Loading config from {config_path}", show=verbose)
nlp = assemble(
config_path
)
doc = nlp(text)
msg.text(f"Quote: {doc.text}")
msg.text(f"Context: {doc._.context}")
if __name__ == "__main__":
typer.run(run_pipeline)
And run:
python3 run_pipeline.py "We must balance conspicuous consumption with conscious capitalism." ./config.cfg
>>>
Quote: We must balance conspicuous consumption with conscious capitalism.
Context: Business ethics.
If you want to change the prompt, just create another Jinja file and create a my_namespace.QuoteContextExtractTask.v2 task the same way we’ve created the first one. If you want to change the temperature, just change the parameter on the config.cfg file. Nice, right?
Final thoughts
The ability to handle OpenAI REST requests and its straightforward approach to storing and versioning prompts are my favorite things about spacy-llm. Additionally, the library offers a Cache for caching prompts and responses per document, a method for providing examples for few-shot prompts, and a logging feature, among other things.
You can take a look at the entire code from today here: https://github.com/dmesquita/spacy-llm-elegant-prompt-versioning.
As always, thank you for reading!
Elegant prompt versioning and LLM model configuration with spacy-llm was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Managing prompts and handling OpenAI request failures can be a challenging task. Fortunately, spaCy released spacy-llm, a powerful tool that simplifies prompt management and eliminates the need to create a custom solution from scratch.
In this article, you will learn how to leverage spacy-llm to create a task that extracts data from text using a prompt. We will dive into the basics of spacy and explore some of the features of spacy-llm.
spaCy and spacy-llm 101
spaCy is a library for advanced NLP in Python and Cython. When dealing with text data, several processing steps are typically required, such as tokenization and POS tagging. In order to execute these steps, spaCy provides the nlp method, which invokes a processing pipeline.
spaCy v3.0 introduces config.cfg, a file where we can include detailed settings of these pipelines.
config.cfg uses confection, a config system which allows the creation of arbitrary object trees. For instance, confection parsers the following config.cfg:
[training]
patience = 10
dropout = 0.2
use_vectors = false
[training.logging]
level = "INFO"
[nlp]
# This uses the value of training.use_vectors
use_vectors = ${training.use_vectors}
lang = "en"
into:
{
"training": {
"patience": 10,
"dropout": 0.2,
"use_vectors": false,
"logging": {
"level": "INFO"
}
},
"nlp": {
"use_vectors": false,
"lang": "en"
}
}
Each pipeline use components, and spacy-llm stores the pipeline components into registries using catalogue. This library, also from Explosion, introduces function registries that allow for efficient management of the components. A llmcomponent is defined into two main settings:
- A task, defining the prompt to send to the LLM as well as the functionality to parse the resulting response
- A model, defining the model and how to connect to it
To include a component that uses a LLM in our pipeline, we need to follow a few steps. First, we need to create a task and register it into the registry. Next, we can use a model to execute the prompt and retrieve the responses. Now it’s time to do all that so we can run the pipeline
Creating a task to extract data from text
We will use quotes from https://dummyjson.com/ and create a task to extract the context from every quote. We will create the prompt, register the task and finally create the config file.
1. The prompt
spacy-llm uses Jinja templates to define the instructions and examples. The {{ text }} will be replaced by the quote we will provide. This is our prompt:
You are an expert at extracting context from text.
Your tasks is to accept a quote as input and provide the context of the quote.
This context will be used to group the quotes together.
Do not put any other text in your answer and provide the context in 3 words max.
{# whitespace #}
{# whitespace #}
Here is the quote that needs classification
{# whitespace #}
{# whitespace #}
Quote:
'''
{{ text }}
'''
Context
2. The task class
Now let’s create the class for the task. The class should implement two functions:
- generate_prompts(docs: Iterable[Doc]) -> Iterable[str]: a function that takes in a list of spaCy Doc objects and transforms them into a list of prompts
- parse_responses(docs: Iterable[Doc], responses: Iterable[str]) -> Iterable[Doc]: a function for parsing the LLM's outputs into spaCy Doc objects
generate_prompts will use our Jinja template and parse_responses will add the attribute context to our Doc. This is the QuoteContextExtractTask class:
from pathlib import Path
from spacy_llm.registry import registry
import jinja2
from typing import Iterable
from spacy.tokens import Doc
TEMPLATE_DIR = Path("templates")
def read_template(name: str) -> str:
"""Read a template"""
path = TEMPLATE_DIR / f"{name}.jinja"
if not path.exists():
raise ValueError(f"{name} is not a valid template.")
return path.read_text()
class QuoteContextExtractTask:
def __init__(self, template: str = "quotecontextextract.jinja", field: str = "context"):
self._template = read_template(template)
self._field = field
def _check_doc_extension(self):
"""Add extension if need be."""
if not Doc.has_extension(self._field):
Doc.set_extension(self._field, default=None)
def generate_prompts(self, docs: Iterable[Doc]) -> Iterable[str]:
environment = jinja2.Environment()
_template = environment.from_string(self._template)
for doc in docs:
prompt = _template.render(
text=doc.text,
)
yield prompt
def parse_responses(
self, docs: Iterable[Doc], responses: Iterable[str]
) -> Iterable[Doc]:
self._check_doc_extension()
for doc, prompt_response in zip(docs, responses):
try:
setattr(
doc._,
self._field,
prompt_response.replace("Context:", "").strip(),
),
except ValueError:
setattr(doc._, self._field, None)
yield doc
Now we just need to add the task to the spacy-llm llm_tasks register:
@registry.llm_tasks("my_namespace.QuoteContextExtractTask.v1")
def make_quote_extraction() -> "QuoteContextExtractTask":
return QuoteContextExtractTask()
3. The config.cfg file
We’ll use the GPT-3.5 model from OpenAI. spacy-llm has a model for that so we just need to make sure the secret key is available as an environmental variable:
export OPENAI_API_KEY="sk-..."
export OPENAI_API_ORG="org-..."
To build the nlp method that runs the pipeline we’ll use the assemble method from spacy-llm. This methods reads from a .cfg file. The file should reference the GPT-3.5 model (it’s already in he registry) and the task we’ve created:
[nlp]
lang = "en"
pipeline = ["llm"]
batch_size = 128
[components]
[components.llm]
factory = "llm"
[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.1}
[components.llm.task]
@llm_tasks = "my_namespace.QuoteContextExtractTask.v1"
4. Running the pipeline
Now we just need to put everything together and run the code:
import os
from pathlib import Path
import typer
from wasabi import msg
from spacy_llm.util import assemble
from quotecontextextract import QuoteContextExtractTask
Arg = typer.Argument
Opt = typer.Option
def run_pipeline(
# fmt: off
text: str = Arg("", help="Text to perform text categorization on."),
config_path: Path = Arg(..., help="Path to the configuration file to use."),
verbose: bool = Opt(False, "--verbose", "-v", help="Show extra information."),
# fmt: on
):
if not os.getenv("OPENAI_API_KEY", None):
msg.fail(
"OPENAI_API_KEY env variable was not found. "
"Set it by running 'export OPENAI_API_KEY=...' and try again.",
exits=1,
)
msg.text(f"Loading config from {config_path}", show=verbose)
nlp = assemble(
config_path
)
doc = nlp(text)
msg.text(f"Quote: {doc.text}")
msg.text(f"Context: {doc._.context}")
if __name__ == "__main__":
typer.run(run_pipeline)
And run:
python3 run_pipeline.py "We must balance conspicuous consumption with conscious capitalism." ./config.cfg
>>>
Quote: We must balance conspicuous consumption with conscious capitalism.
Context: Business ethics.
If you want to change the prompt, just create another Jinja file and create a my_namespace.QuoteContextExtractTask.v2 task the same way we’ve created the first one. If you want to change the temperature, just change the parameter on the config.cfg file. Nice, right?
Final thoughts
The ability to handle OpenAI REST requests and its straightforward approach to storing and versioning prompts are my favorite things about spacy-llm. Additionally, the library offers a Cache for caching prompts and responses per document, a method for providing examples for few-shot prompts, and a logging feature, among other things.
You can take a look at the entire code from today here: https://github.com/dmesquita/spacy-llm-elegant-prompt-versioning.
As always, thank you for reading!
Elegant prompt versioning and LLM model configuration with spacy-llm was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.