Building a Deep Neural Network from Scratch using Numpy | by Riccardo Andreoni | Sep, 2022

By Jessie Hobb On Sep 22, 2022

Modern Deep Learning libraries are powerful tools but they may lead practitioners to take for granted neural networks’ functioning principles

In this project, I build a deep neural network without the aid of any deep learning library (Tensorflow, Keras, Pytorch). The reason for imposing myself on this task is that, nowadays, it is effortless to build deep and complex neural networks using the high-level tools provided by multiple python libraries. Undoubtedly, this is a great advantage for Machine Learning professionals: we can create powerful models with just a few lines of code. However, this approach has the massive downside of leaving the functioning of those networks unclear as they happen “under the hood”.

Building a Deep Neural Network from scratch is a great exercise for anyone who wants to solidify their understanding of these amazing tools.

The article will cover both the theoretical and practical parts. The theoretical part is mandatory to understand the implementation. For the theory, we need basic knowledge of algebra and calculus, while for the coding part only built-in Python functions and Numpy will be used.

This approach differs from other implementations for the strategy of storing the cached values. Also, differently from most implementations, this code allows us to compare infinite possible network architectures as the number of layers and activation units is defined by the user.

In this application, I create a Deep Neural Network to solve the famous MNIST classification problem.

The MNIST dataset is a large database of handwritten digits. The dataset contains 70,000 small images (28 x 28 pixels), each one of them being labeled.

Handwritten digits from the MNIST dataset

In this section, I will outline the theoretical part of the application. I will define all the matrices for each step of the forward propagation and backpropagation, with particular attention to clarifying all the matrix dimensions.

Input

The input consists of m training images of shape 28×28 pixels. Consequently, each image is represented by a 1-dimensional array of size 784. In order to speed up the computations, I will take advantage of the vectorization technique. I will store the entire training set in a single matrix X. Each column of X represents a training example:

The dimensions are:

Forward Propagation

To clarify the explanation, let’s assume building a neural network composed of:

input layer
1 hidden layer of size 10 with ReLu activation function
1 hidden layer of size 10 with Softmax activation function
output layer

All the matrices and computation can be easily extended to a fully connected network of any architecture.

The forward propagation, for each layer, is composed of 2 steps:

application of weights and biases
computation of the activation function

For hidden layer 1 we use matrix multiplication and matrix addition to apply the weights and biases:

Then, we need to compute the selected activation function:

Following the matrix multiplication rules, the dimensions are:

The same is done for layer 2:

And the matrix dimensions are:

In general, for any layer l the two steps are carried out through these simple equations:

At the end of forward propagation, reaching layer L, we compute the prediction:

Backpropagation

The purpose of backpropagation is to compute the partial derivative of the loss function with respect to the weights of each layer of the network. Once we know the derivatives, we can apply gradient descent optimization to tune their values.

The first step of backpropagation is to compute the error of the predictions. Considering layer 2 as the final layer, we have:

where:

Now we can compute the derivatives of the loss function with respect to the weights and biases of layer 2:

The dimensions are as follows:

Once we know all the derivatives of the final layers, the backpropagation process consists in traveling backward through the network’s layers and computing the partial derivatives as follows:

Parameter Update

Knowing the gradients of the loss function, we know in which direction to move to reach an optimum. As a consequence, we update the parameters:

This section presents all the functions used to implement the deep neural network. The complete code can be found on my GitHub repository.

In the above linked GitHub repository, you will find 5 files:

“README.md”: it’s a markdown file presenting the project
“train.csv”: it’s a CSV file containing the training set of the MNIST dataset
“test.csv”: it’s a CSV file containing the test set of the MNIST dataset
“main.py”: it’s a Python script from where we will run the neural network
“utils.py”: it’s a Python file in which we define the function needed to build the neural network

We will mainly focus on the “utils.py” file since it’s where most of the network implementation is.

The first function is init_params. It takes as input the dimensions of the layers and it returns a dictionary containing all the weights and biases randomly initialized: