How Pytorch 2.0 Accelerates Deep Learning with Operator Fusion and CPU/GPU Code-Generation | by Shashank Prasanna | Apr, 2023

By Jessie Hobb On Apr 20, 2023

A primer on deep learning compiler technologies in PyTorch for graph capture, intermediate representations, operator fusion, and optimized C++ and GPU code generation

Computer programming is magical. We write code in human readable languages, and as though by magic, it gets translated into electric currents through silicon transistors making them behave like switches and allowing them to implement complex logic — just so we can enjoy cat videos on the internet. Between the programming language and hardware processors that run it, is an important piece of technology — the compiler. A compiler’s job is to translate and simplify our human readable language code into instructions that a processor understands.

Compilers play a very important role in deep learning to improve training and inference performance, improve energy efficiency, and target diverse AI accelerator hardware. In this blog post I’m going to discuss deep learning compiler technologies that powers PyTorch 2.0. I’ll walk you through the different phases of the compilation process and discuss various underlying technologies with code examples and visualizations.

A deep learning compiler translates high-level code written in deep learning frameworks into optimized lower level hardware specific code to accelerate training and inference. It finds opportunities in deep learning models to optimize for performance by performing layer and operator fusion, better memory planning, and generating target specific optimized fused kernels to reduce function call overhead.

Unlike traditional software compilers, deep learning compilers have to work with highly-parallelizable code often accelerated on specialized AI accelerator hardware (GPUs, TPUs, AWS Trainium/Inferentia, Intel Habana Gaudi etc.). To improve performance, a deep learning compiler has to take advantage of hardware specific features such as mixed precision support, performance optimized kernels and minimize communication between host (CPU) and AI accelerator.

While deep learning algorithms are continuing to advance at a rapid pace, hardware AI accelerators have also been evolving alongside to keep up with deep learning algorithm performance and efficiency needs. I discuss the co-evolution of algorithms and AI accelerators in an earlier blog post:

In this blog post I’ll focus on the software side of things, and particularly the subset of software closer to the hardware — deep learning compilers. First, let’s start by taking a look at different functions in a deep learning compiler.

PyTorch 2.0 includes new compiler technologies to improve model performance and runtime efficiency and target diverse hardware backends with a simple API: torch.compile(). While other blog posts and articles have discussed performance benefits of PyTorch 2.0 in detail, here I’m going to focus on what happens under the hood when you invoke the PyTorch 2.0 compiler. If you’re looking for quantified performance benefits, you can find a performance dashboard of different models from huggingface, timm and torchbench.

At a high-level the default options for PyTorch 2.0 deep learning compiler performs the following key tasks:

Graph capture: Computational graph representation for your models and functions. PyTorch technologies: TorchDynamo, Torch FX, FX IR
Automatic differentiation: Backward graph tracing using automatic differentiation and lowering to primitives operators. PyTorch technologies: AOTAutograd, Aten IR
Optimizations: Forward and backward graph-level optimizations and operator fusion. PyTorch technologies: TorchInductor (default) or other compilers
Code generation: Generating hardware specific C++/GPU Code. PyTorch technologies: TorchInductor, OpenAI Triton (default) other compilers

Through these steps, the compiler transforms your code and generates intermediate representations (IRs) that are progressively “lowered”. Lowering is a term in the compiler lexicon that refers to mapping a broad set of operations (such as supported by PyTorch API) to a narrow set of operations (such as supported by hardware) through automatic transformation and re-writing by the compiler. The PyTorch 2.0 compiler flow:

If you are new to compiler terminology don’t let all of this scare you yet. I’m not a compiler engineer either. Keep reading and things will become clear as I’ll break the process down using a simple example and visualizations.

Note: This whole walkthrough is in a Jupyter Notebook hosted here

For the sake of simplicity, I’ll define a very simple function and run it through the PyTorch 2.0 compiler process. You can replace this function with a deep neural network model or an nn.Module subclass, but this example should help you appreciate what’s going on under the hood much better than a complex multi-million parameter model.

PyTorch code for that function:

def f(x):
return torch.sin(x)**2 + torch.cos(x)**2

If you paid attention in high-school trigonometry class, you know that the value of our function is always going to be 1 for all real valued x. Which means it’s derivative, a derivative of a constant, and must be equal to zero. This will come in handy to verify what the function and its derivatives are doing.

Now, it’s time to call torch.compile() . First let’s convince ourselves that compiling this function doesn’t change its output. For the same 1×1000 random vector the mean squared error between the output of our function and a vector of 1s should be zero for both the compiled and the uncompiled function (under some error tolerance).

All we did was add a single line of extra code torch.compile() to invoke our compiler. Let’s now take a look at what’s happening under the hood at each stage.

PyTorch technologies: TorchDynamo, FX Graphs, FX IR

The first step for the compiler is to determine what to compile. Enter TorchDynamo. TorchDynamo intercepts the execution of your Python code and transforms it into FX intermediate representation (IR), and stores it in a special data structure called FX Graph. What does this look like you ask? Glad you asked. Below, we’ll take a looks the code we use to generate this, but here is the transformation and output:

It’s important to note that Torch FX graphs are just containers for IR and don’t really specify what operators it should hold. In the next section we’ll see the FX graph container come up again with a different set of IRs. If you compare the function code and FX IR there’s very little difference between the two. In fact, it’s the same PyTorch code you wrote, but laid out in a format that the FX graph data structure expects. They both will provide the same result when executed.

If you call torch.compile() without any arguments, it’ll use the default settings which runs the entire compiler stack which includes the default hardware backend compiler called TorchInductor. But we’d be jumping ahead if we discussed TorchInductor now, so let’s park that topic for now, and we’ll come back to it when we’re ready. First we need to discuss graph capture and we can do that by intercepting the calls from torch.compile(). Here’s how we’ll do that: torch.compile() allows you to provide your own compiler too, but because I’m not a compiler engineer, and I don’t have the slightest clue how to write a compiler, I’ll provide a fake compiler function to capture the FX graph IR that TorchDynamo generates.

Below is our fake compiler backend function called inspect_backend to torch.compile() and within that function I do two things:

Print the FX IR code that was captured by TorchDynamo
Save the FX graph visualization

A primer on deep learning compiler technologies in PyTorch for graph capture, intermediate representations, operator fusion, and optimized C++ and GPU code generation

At a high-level the default options for PyTorch 2.0 deep learning compiler performs the following key tasks:

Graph capture: Computational graph representation for your models and functions. PyTorch technologies: TorchDynamo, Torch FX, FX IR
Automatic differentiation: Backward graph tracing using automatic differentiation and lowering to primitives operators. PyTorch technologies: AOTAutograd, Aten IR
Optimizations: Forward and backward graph-level optimizations and operator fusion. PyTorch technologies: TorchInductor (default) or other compilers
Code generation: Generating hardware specific C++/GPU Code. PyTorch technologies: TorchInductor, OpenAI Triton (default) other compilers

Note: This whole walkthrough is in a Jupyter Notebook hosted here

PyTorch code for that function:

def f(x):
return torch.sin(x)**2 + torch.cos(x)**2

All we did was add a single line of extra code torch.compile() to invoke our compiler. Let’s now take a look at what’s happening under the hood at each stage.

PyTorch technologies: TorchDynamo, FX Graphs, FX IR

Below is our fake compiler backend function called inspect_backend to torch.compile() and within that function I do two things:

Print the FX IR code that was captured by TorchDynamo
Save the FX graph visualization

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

How Pytorch 2.0 Accelerates Deep Learning with Operator Fusion and CPU/GPU Code-Generation | by Shashank Prasanna | Apr, 2023

A primer on deep learning compiler technologies in PyTorch for graph capture, intermediate representations, operator fusion, and optimized C++ and GPU code generation

What is backward pass and backward graph?

How does PyTorch 2.0 trace the backward pass graph?

What is a deep learning optimizing compiler?

What is operator fusion in deep learning?

A primer on deep learning compiler technologies in PyTorch for graph capture, intermediate representations, operator fusion, and optimized C++ and GPU code generation

What is backward pass and backward graph?

How does PyTorch 2.0 trace the backward pass graph?

What is a deep learning optimizing compiler?

What is operator fusion in deep learning?