Hardware Accelerators for Neural Networks | by Federico Peccia | Oct, 2022

By Jessie Hobb On Oct 6, 2022

How systolic arrays work

Introduction

In a world where neural networks are being used to process almost any kind of data (from images to audio or heart activity), there is an increasing interest in moving the execution of these models from the cloud to edge (embedded) systems. Why is this an interesting trend? Well, these are some of the reasons for this paradigm change:

Energy consumption: instead of running the AI algorithm on expensive and energy-hungry hardware like a GPU, why not build a specific embedded device tailor-made to execute the algorithm much more efficiently?
Privacy concerns: instead of sending the raw sensor/camera data to the cloud, wouldn’t it be better from a privacy point of view to keep the original data in the device, and only share the already processed results to the cloud?

Although there are multiple hardware architectures and solutions to accelerate these algorithms on embedded devices, one of the most attractive ones is the systolic array-based accelerator. First introduced by H. T. Kung in his 1982 paper, these architectures are basically built by repetitively connecting basic computing cells (known as processing element, from now on, PE) in a cascading manner. One important aspect of these arrays is that the PEs are all equal.

The basic idea presented by Kung (see paper for actual examples). Image by Author.

Why is this important? Well, simple and regular designs are easier and cheaper to build. Also, because data is propagated between PEs, this architecture reuses a lot of data, which is attractive in terms of memory bandwidth and energy consumption (data is read from memory only once and reused multiple times).

In this article, you will:

See how the basic mathematical operation of a neural network is represented
Understand how a systolic array is able to execute the matrix multiplication operation (with nice and easy-to-follow animations!)
See an example of how a complete hardware accelerator architecture can be built around a systolic array
Get a summary of hands-on projects, related papers and educational videos about this topic

DISCLAIMER: to completely understand this article, you should have a basic understanding of linear algebra (know what a matrix multiplication is) and digital circuits (know what a register and a clock are). If you do, you are good to go! If not, don’t let that stop you! I am sure you will find the article interesting.

Reducing a complex neural network to the basics

Nowadays, neural networks are becoming extremely complex algorithms, with hundreds of layers and millions (sometimes billions) of parameters. But for the sake of this article, we will focus on the 2 more used layers in today’s popular neural network models: convolutional and fully connected layers.

Fully connected layers can be directly described as a matrix multiplication operation. Convolutional layers are a little more complex, but clever transformations exist in order to transform them also into matrix multiplication operations (see this paper or this paper for examples of how these transformations work). This is why we will only focus on how to design a systolic array to execute this operation.

Just to refresh your linear algebra knowledge, let’s remember how the basic GEneral Matrix Multiply (GEMM) operation is calculated using a small example we will continue to use during the whole article:

Basic 2×2 matrix multiplication. Video by Author.

What is the basic operation here? Take a look at the animation again. Do you see how we are always doing a multiplication between 2 elements, and then accumulating the result of the previous multiplication? This operation is called Multiply-ACCumulate (MACC), and it’s fairly easy to implement in hardware.

The basic PE

So, we need to design some PE able to execute this MACC operation.

As I said before, this operation is a multiplication followed by an accumulation. Now, in neural networks, the A matrix is normally the input of the layer, and the B matrix represents the parameters of the layer (its weights). Because these parameters are constant and do not (normally) change while we are using the neural network to predict something, we could actually preload the values of matrix B into the PE, and reuse them across multiple multiplications! I assure you, this will make more sense later. This can be very easily described with the following diagram (the b register is going to be preloaded with the corresponding b parameter):

But remember we talked about how these systolic architectures cascade PEs one after the other? Well, in order to do so, let’s just add a couple of wires to allow us to share data from one PE to another:

Basic PE with propagation signals. Image by Author.

This is our basic PE! First, notice how we can insert the b parameter from the top, and on each clock cycle, propagate its value to the PE that will be connected to the bottom (this will come in handy when preloading the systolic array with the B matrix). Second, notice how on each clock cycle, new input data a comes from the PE on the left, is multiplicated with the stored weight b, and is then accumulated with the partial sum that is coming from the PE above. The output partial sum is then propagated to the PE below, while the original input data a is propagated to the PE to the right. Finally, I did not draw control, reset and clock signals to keep the diagram as simple as possible.

Next, we will see how this cascading is used to execute an entire matrix multiplication!

DISCLAIMER: Keep in mind that this is only one possible processing element configuration, which uses the Weight Stationary approach (weights are preloaded into the PE first and then the input data from matrix A and the partial sums flow from PE to PE). Other PE designs for matrix multiplications exist, for example, the Output Stationary approach. Different algorithms which can also be accelerated using systolic arrays can have more simple or complex PE designs. For examples of all these other approaches and PE designs, check out the “References” section and the end of the article.

Building a systolic array

Let’s build a 2D array of PEs: a systolic array! As said in the previous section, each PE will receive data from the PEs to its left and top and will propagate data to their right and bottom PEs.

In order to execute the entire matrix multiplication, we will first preload the entire B matrix into the systolic array (remember, we are using the Weight Stationary approach). Then, we will propagate the input matrix A from left to right, and the partial sums will be propagated from top to bottom. The C result matrix will be obtained from the output of the bottom PEs. Look at the following animation which illustrates this procedure, by executing an [2,2]x[2,2] matrix multiplication in a 2×2 systolic array (notice how we are actually inserting the transposed version of the A matrix):

Execution of a 2×2 matrix multiplication on a Weight Stationary systolic array. Image by Author.

Notice how the data needs to enter the systolic array in this “strided” or displaced fashion, and how the result is also obtained in the same way. If the data is moved across the systolic array one clock at a time, it’s fairly easy to calculate how many clocks it takes for a specific configuration to obtain a complete result. If you are interested, check out the work by Asgari et al., where the equations are explained in more detail.

What if instead of executing C=A*B we also want to add a matrix (C=A*B+D, sometimes referred to as the Extended GEMM)? Well, we just insert the D matrix values into the partial sum inputs of the PEs at the top of the array!

But what do we do if our input matrices are bigger than the size of our systolic array? This is when we use the tiling method. Tiling means partitioning our A and B matrices so that we get patches that can be executed in the systolic array. Let’s see an example of an [16,16]x[16,16] matrix multiplication on an 8×8 systolic array (Ax, Bx and Cx are all 8×8 patches):

Tiling example on a systolic array. Video by Author.

Note: if the tiling method generates partitions that are smaller than the size of the systolic array (for example, because the division between the matrices sizes and the systolic array size is not an even number), PEs are going to stay idle during the computation. Of course, we want to ideally use as many PEs as possible during a computation. Finding operator mapping techniques that maximize PE usage across different workloads is an active research field!

Integrating it into a complete hardware accelerator architecture

Just as an example, to see how all this plays together in a real use case scenario, let’s add some more modules to our hardware accelerator.

First, we will add memory to store matrix A (known as the input feature map), matrix B (as we already said, the weights), and matrix C (known as the output feature map). Then we add a module (for example, a DMA) to move data between the accelerator’s internal memory and the external DRAM (in case the matrices are too big to fit the accelerator’s internal memory, which will happen in almost all cases if you are trying to run one of today’s popular neural networks, like YoloV7 or EfficientDet-D7). Finally, we add a controller module to orchestrate the move of data in and out of the accelerator and how matrices must enter the systolic array.

Basic accelerator’s diagram. Image by Author.

Do you want to try a systolic array simulator, play with different scratchpad sizes and processing element array dimensions, and analyse the execution of entire neural network architectures? Check out Scale-SIM [2], a Python simulator for systolic arrays where you can explore the trade-off between internal memory sizes and systolic array dimensions for workloads like convolutional or fully connected layers.