Techno Blender
Digitally Yours.

End Python Dependency Hell with pip-compile-multi | by Jake Schmidt | Dec, 2022

0 36


Photo by John Barkiple on Unsplash

Most Python projects of consequence have complex dependency management requirements that are inadequately addressed by common open-source solutions. Some tools try to tackle the entire packaging experience, while others aim to solve one or two narrow subproblems. Despite the myriad solutions, developers still face the same dependency management challenges:

  1. How can new users and contributors easily and correctly install dependencies?
  2. How do I know my dependencies are all compatible?
  3. How do I make builds deterministic and reproducible?
  4. How do I ensure my deployment artifacts use coherent and compatible dependencies?
  5. How do I avoid dependency bloat?

This post will focus on answering these questions using pip-compile-multi, an open-source command line tool that extends the capabilities of the popular pip-tools to address the needs of projects with complex dependencies.

A partial solution is to maintain a dependency lockfile, and tools such as poetry and pip-tools enable this. We can think of a lockfile almost like a “dependency interface”: an abstraction that tells the project what external dependencies it needs to function properly. The problem with having a single, monolithic lockfile for your entire project is that, as an interface, it is not well-segregated: to ensure compatibility, determinism, and reproducibility, every consumer of the code (user, developer, packaging system, build artifact, deployment target) will need to install every single dependency the lockfile enumerates—whether they actually use it or not. You’ve encountered this issue if you’ve ever struggled to separate your linting and testing libraries out of your production build, for example.

The resulting dependency bloat can be a real issue. Aside from unnecessarily ballooning build times and package/artifact size, it increases the surface area of security vulnerabilities in your project or application.

Vulnerabilities I found in one of my projects using safety.

Ideally, we could restructure our dependency interface into multiple, narrower ones—multiple lockfiles that:

  • group dependencies by function
  • can be composed with each other
  • can be consumed independently
  • are mutually compatible

If we can do that, things get easier:

  • understanding what dependencies are used where
  • packaging variants (e.g. defining pip extras)
  • multi-stage builds (e.g. Docker multi-stage)

Fortunately, pip-compile-multi does all of the above! It’s a lightweight, pip– installable CLI built on top of the excellent pip-tools project. You simply split your requirements.txt file into one or more pip requirements files (typically suffixed .in). Each file may contain one or -r / --requirement options, which link the files together as a Directed Acyclic Graph (DAG). This DAG representation of dependencies is central to pip-compile-multi.

Example

Let’s say your requirements.txt looks like this:

# requirements.txt

flake8
mypy
numpy
pandas
torch>1.12

The first step is to split out these dependencies into functional groups. We’ll write one group to main.in and another to dev.in. We should now delete our requirements.txt. Our two new .in files might look something like this, forming a simple two-node dependency DAG:

A simple two-node dependency DAG. Main project dependencies go in `main.in`, and our code linters and relating dev tooling go into `dev.in`. This keeps our dependencies logically grouped.

Each node is a .in file defining a dependency group. Each directed edge represents the requirement of one group by another. Each node defines its own in-edges with one or more -r / --requirement options.

Once we have this dependency DAG defined, running pip-compile-multi will generate an equivalent lockfile DAG. The tool will output a .txt pip requirements file for each .in in the DAG.

The lockfile DAG compiled by pip-compile-multi. I’ve removed the autogenerated inline comments in these lockfiles, but in practice you should never need to manually edit them.

By default, the produced lockfiles will be created in the same directory as the .in files and mirror their names.

Autoresolution of cross-file conflicts

The killer feature that separates pip-compile-multi from other lockfiles tools such as pip-tools is autoresolution of cross-file conflicts, easily enabled with the --autoresolve flag. In autoresolve mode, pip-compile-multi will first pre-solve for all dependencies, then use that solution to constrain each node’s individual solution. This ensures each lockfile remains mutually compatible by preventing any conflicts in their transient dependencies. In order to use autoresolution, your DAG must have exactly one source node (note that the pip-compile-multi documentation, inverts the directionality of DAG edges, so they will refer to sink nodes when I say source, and vice-versa).

Lockfile verification

Another useful command is pip-compile-multi verify, which checks that your lockfiles match what is specified in your .in files. This is a simple yet valuable check you can easily incorporate into your CICD pipeline to protect against errant dependency updates. And it’s even available as a precommit hook!

Organize dependencies appropriately

If you group your dependencies poorly, you’re setting yourself up for failure. Try to define groups based on the intended function of the dependencies in your code: don’t put flake8 (a code linter) in a group with torch (a deep learning framework).

Have a single source node and a single sink node

I’ve found that things work best when you can organize your most ubiquitous dependencies into a single “core” set of dependencies that all other nodes require (a sink node), and all of your development dependencies in a node that requires all others (directly or indirectly) require (a source). This pattern keeps your DAG relatively simple and ensures you can use pip-compile-multi’s great autoresolve feature.

Enable the pip cache

Setting the --use-cache flag can drastically speed up pip-compile-multi because it enables caching in the underlying calls to pip-compile.

To make things more clear, let’s work through an example from the realm of machine learning.

A typical machine learning system will have at least two components: a training workload that creates a model on some data, and an inference server to serve model predictions.

Both components will have some common dependencies, such as libraries for data processing and modeling. We can list these in a text file called main.in, which is just a pip requirements file:

# requirements/main.in

pandas
torch>1.12

The training component might have some idiosyncratic dependencies for distributed communications, experiment tracking, and metric computation. We’ll put these in training.in:

# requirements/training.in

-r main.in

horovod
mlflow==1.29
torchmetrics

Notice we add the -r flag, which tells pip-compile-multi that training.in requires the dependencies from main.in.

The inference component will have some exclusive dependencies for serving and monitoring, which we add to inference.in:

# requirements/inference.in

-r main.in

prometheus
torchserve

Finally, the entire codebase shares the same development toolchain. These development tools, such as linters, unit testing modules, and even pip-compile-multi itself go in dev.in:

# requirements/dev.in

-r inference.in
-r training.in

flake8
pip-compile-multi
pytest

Again, notice the -r flags indicating dev.in depends on training.in and inference.in. We don’t need a -r main.in because training.in and inference.in already have it.

Together, our dependency DAG looks like this:

A four-node dependency DAG.

Assuming our .in files are inside a directory called requirements/, we can use the following command to solve our DAG and generate lockfiles:

pip-compile-multi --autoresolve --use-cache --directory=requirements

After the command succeeds, you will see four new files inside requirements/: main.txt, training.txt, inference.txt, and dev.txt. These are our lockfiles. We can use them the same way we’d use a valid requirements.txt file. Perhaps we could use them to build efficient Docker multi-stage image targets:

Or perhaps we are a new project contributor installing the environment. We could simply run pip install -r requirements/dev.txt (or even better: pip-sync requirements/dev.txt) to install the project in “development” mode, with all the dev dependencies.

The number of tooling options for managing Python dependencies is overwhelming. Few tools have great support for segmenting dependencies by function, which I argue is becoming a common project requirement. While pip-compile-multi is not a silver bullet, it enables elegant dependency segregation, and adding it to your project is straightforward!


Photo by John Barkiple on Unsplash

Most Python projects of consequence have complex dependency management requirements that are inadequately addressed by common open-source solutions. Some tools try to tackle the entire packaging experience, while others aim to solve one or two narrow subproblems. Despite the myriad solutions, developers still face the same dependency management challenges:

  1. How can new users and contributors easily and correctly install dependencies?
  2. How do I know my dependencies are all compatible?
  3. How do I make builds deterministic and reproducible?
  4. How do I ensure my deployment artifacts use coherent and compatible dependencies?
  5. How do I avoid dependency bloat?

This post will focus on answering these questions using pip-compile-multi, an open-source command line tool that extends the capabilities of the popular pip-tools to address the needs of projects with complex dependencies.

A partial solution is to maintain a dependency lockfile, and tools such as poetry and pip-tools enable this. We can think of a lockfile almost like a “dependency interface”: an abstraction that tells the project what external dependencies it needs to function properly. The problem with having a single, monolithic lockfile for your entire project is that, as an interface, it is not well-segregated: to ensure compatibility, determinism, and reproducibility, every consumer of the code (user, developer, packaging system, build artifact, deployment target) will need to install every single dependency the lockfile enumerates—whether they actually use it or not. You’ve encountered this issue if you’ve ever struggled to separate your linting and testing libraries out of your production build, for example.

The resulting dependency bloat can be a real issue. Aside from unnecessarily ballooning build times and package/artifact size, it increases the surface area of security vulnerabilities in your project or application.

Vulnerabilities I found in one of my projects using safety.

Ideally, we could restructure our dependency interface into multiple, narrower ones—multiple lockfiles that:

  • group dependencies by function
  • can be composed with each other
  • can be consumed independently
  • are mutually compatible

If we can do that, things get easier:

  • understanding what dependencies are used where
  • packaging variants (e.g. defining pip extras)
  • multi-stage builds (e.g. Docker multi-stage)

Fortunately, pip-compile-multi does all of the above! It’s a lightweight, pip– installable CLI built on top of the excellent pip-tools project. You simply split your requirements.txt file into one or more pip requirements files (typically suffixed .in). Each file may contain one or -r / --requirement options, which link the files together as a Directed Acyclic Graph (DAG). This DAG representation of dependencies is central to pip-compile-multi.

Example

Let’s say your requirements.txt looks like this:

# requirements.txt

flake8
mypy
numpy
pandas
torch>1.12

The first step is to split out these dependencies into functional groups. We’ll write one group to main.in and another to dev.in. We should now delete our requirements.txt. Our two new .in files might look something like this, forming a simple two-node dependency DAG:

A simple two-node dependency DAG. Main project dependencies go in `main.in`, and our code linters and relating dev tooling go into `dev.in`. This keeps our dependencies logically grouped.

Each node is a .in file defining a dependency group. Each directed edge represents the requirement of one group by another. Each node defines its own in-edges with one or more -r / --requirement options.

Once we have this dependency DAG defined, running pip-compile-multi will generate an equivalent lockfile DAG. The tool will output a .txt pip requirements file for each .in in the DAG.

The lockfile DAG compiled by pip-compile-multi. I’ve removed the autogenerated inline comments in these lockfiles, but in practice you should never need to manually edit them.

By default, the produced lockfiles will be created in the same directory as the .in files and mirror their names.

Autoresolution of cross-file conflicts

The killer feature that separates pip-compile-multi from other lockfiles tools such as pip-tools is autoresolution of cross-file conflicts, easily enabled with the --autoresolve flag. In autoresolve mode, pip-compile-multi will first pre-solve for all dependencies, then use that solution to constrain each node’s individual solution. This ensures each lockfile remains mutually compatible by preventing any conflicts in their transient dependencies. In order to use autoresolution, your DAG must have exactly one source node (note that the pip-compile-multi documentation, inverts the directionality of DAG edges, so they will refer to sink nodes when I say source, and vice-versa).

Lockfile verification

Another useful command is pip-compile-multi verify, which checks that your lockfiles match what is specified in your .in files. This is a simple yet valuable check you can easily incorporate into your CICD pipeline to protect against errant dependency updates. And it’s even available as a precommit hook!

Organize dependencies appropriately

If you group your dependencies poorly, you’re setting yourself up for failure. Try to define groups based on the intended function of the dependencies in your code: don’t put flake8 (a code linter) in a group with torch (a deep learning framework).

Have a single source node and a single sink node

I’ve found that things work best when you can organize your most ubiquitous dependencies into a single “core” set of dependencies that all other nodes require (a sink node), and all of your development dependencies in a node that requires all others (directly or indirectly) require (a source). This pattern keeps your DAG relatively simple and ensures you can use pip-compile-multi’s great autoresolve feature.

Enable the pip cache

Setting the --use-cache flag can drastically speed up pip-compile-multi because it enables caching in the underlying calls to pip-compile.

To make things more clear, let’s work through an example from the realm of machine learning.

A typical machine learning system will have at least two components: a training workload that creates a model on some data, and an inference server to serve model predictions.

Both components will have some common dependencies, such as libraries for data processing and modeling. We can list these in a text file called main.in, which is just a pip requirements file:

# requirements/main.in

pandas
torch>1.12

The training component might have some idiosyncratic dependencies for distributed communications, experiment tracking, and metric computation. We’ll put these in training.in:

# requirements/training.in

-r main.in

horovod
mlflow==1.29
torchmetrics

Notice we add the -r flag, which tells pip-compile-multi that training.in requires the dependencies from main.in.

The inference component will have some exclusive dependencies for serving and monitoring, which we add to inference.in:

# requirements/inference.in

-r main.in

prometheus
torchserve

Finally, the entire codebase shares the same development toolchain. These development tools, such as linters, unit testing modules, and even pip-compile-multi itself go in dev.in:

# requirements/dev.in

-r inference.in
-r training.in

flake8
pip-compile-multi
pytest

Again, notice the -r flags indicating dev.in depends on training.in and inference.in. We don’t need a -r main.in because training.in and inference.in already have it.

Together, our dependency DAG looks like this:

A four-node dependency DAG.

Assuming our .in files are inside a directory called requirements/, we can use the following command to solve our DAG and generate lockfiles:

pip-compile-multi --autoresolve --use-cache --directory=requirements

After the command succeeds, you will see four new files inside requirements/: main.txt, training.txt, inference.txt, and dev.txt. These are our lockfiles. We can use them the same way we’d use a valid requirements.txt file. Perhaps we could use them to build efficient Docker multi-stage image targets:

Or perhaps we are a new project contributor installing the environment. We could simply run pip install -r requirements/dev.txt (or even better: pip-sync requirements/dev.txt) to install the project in “development” mode, with all the dev dependencies.

The number of tooling options for managing Python dependencies is overwhelming. Few tools have great support for segmenting dependencies by function, which I argue is becoming a common project requirement. While pip-compile-multi is not a silver bullet, it enables elegant dependency segregation, and adding it to your project is straightforward!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment