Techno Blender
Digitally Yours.

Neural Network from Scratch (No NumPy) | by Piotr Lachert

0 50


Photo by Kevin Canlas on Unsplash

AI has been developing at an alarming rate for the last several years. There are already plenty of high-level frameworks that enable you to build the model and run the training without having to worry about all the technical details. That’s great news. Not only does it make things easier, it also makes AI algorithms available to more people. On the other hand, it comes at a price. We don’t have to deeply understand the details of neural networks (NNs) anymore and this may pose a threat. In order to be great at what you do you need to go deeper. Just to be clear — I don’t mean we should all go back to the beginning and code everything by ourselves. What I mean is we should go back to the basics from time to time and make sure that we understand what is actually going on when we train our models. In my opinion the best way to check if you really understand how neural networks learn is to implement the whole process all by yourself.

The article is going to describe the process of implementing some basic neural networks in Python. I intentionally limited myself to use only standard Python library. I wanted the article to be valuable also to those who are not familiar with linear algebra, so they can gain intuition about how neural networks learn. Let’s get started.

Before we dive into details, let’s mention some basic facts about neural networks:

  • a neural network consists of layers that perform certain calculations on the input and pass the calculated values to the next layer.
  • the calculations performed in the layers are dependent on some numbers called weights
  • the weights can be modified in a process called training in order to change the transformation of the input in such a way that the output fits the training data better in terms of some objective
  • the objective is called Loss function. It tells us how wrong the model is e.g. when training a model to predict person’s age based on their voice the loss could be the mean absolute error in years over the training samples.

If you’re unfamiliar with these concepts or want to learn more I recommend a great series on Deep Learning by 3Blue1Brown.

The code is available here. I wanted the implementation to be easy to follow in terms of the scope that is discussed in the article. It is highly inefficient as there is no vectorization and linear algebra involved.

Forward pass

Let’s start with building everything that is needed to perform calculations on the input. Neural networks are made up of layers, so we definitely are going to need an object to represent a layer. Let’s make it an abstract class. By doing so we don’t have to think how to implement things in details yet.

As we already defined a template of a layer we can create a model (Neural Network).

For now the model can only perform the forward pass on the input and return the output. The output of each layer is stored in attribute outputs. It is going to be needed later.

One more thing that we need to do is to implement a layer as we only have a template. As you might expect we are going to create a linear layer. Before diving in the code, let’s recap what a linear layer does:

Fig. 1: Linear layer with 2 inputs and 2 outputs

As you can see we need to define the size of the input (how many numbers are going to be passed to the layer) and the size of the output. We can also make a distinction between the parameters by which the inputs are being multiplied (wjk) and the parameters that are being added to the output (bk). We are going to call the former weights and the later biases. Let’s also stick to the commonly used index convention: wjk is the weight between j-th input neuron and k-th output neuron. With all that in mind we are ready to code the linear layer.

The above code does two things:

  • initializes weights (random numbers in range (-0.1, 0.1)) and biases (random number from 0 to 1).
  • computes the output of the layer by applying linear transformation (see Fig. 1)

Now we can already create a model that consists of linear layers. However, it’s not really useful until we can train it. Let’s implement backpropagation.

Backpropagation

Again, before moving on to the implementation, let’s think about this remarkable idea which makes training neural networks possible — backpropagation. Well, it may not be 100% correct to talk about backpropagation algorithm in isolation from linear algebra, but for educational purposes let’s assume we will. Although I think it’s generally a good idea to get familiar with some basics of linear algebra, it may be easier to understand the backpropagation by looking at one parameter at a time. Let’s get started.

Fig. 2: Simple neural network with 3 parameters (without biases).

For simplicity, let’s consider a very simple neural network that expects one number (feature), performs three linear transformations (with bias=0) and outputs one number (see Fig.2). Also, let’s consider only one training sample (X=1, Y=12).

I like to think about backpropagation as a tool that enables you to disassemble the neural network and look at one layer at a time. The first step in the algorithm is to take the loss and see how it would change if we changed the output of the model (O3):

Fig.3 Starting backpropagation with calculating the error at the last layer

What does dL/dO3 tell us? It tells us that if we increased O3 by a small amount, the loss would decrease 12 times more. Notice that if O3 was higher than Y dL/dO3 would be positive. We need this information to know if O3 should be increased or decreased. Also, if O3 is equal to Y the derivative (dL/dO3) is 0 — the model is perfect and O3 is just right. OK, Let’s implement this part:

As you can see, loss object needs two methods:

  • compute_cost that returns the value of loss function for given prediction (y_hat)
  • compute_loss_grad which is the same as dL/O3 (see fig. 3)

Notice that we take the first index of y and y_hat. This is because in current implementation linear layers output lists even when it’s only one number. For consistency the target (y) is also passed as a one element list. As we are going to train with batch_size=1, it’s OK. In proper implementation they would probably be arrays or tensors and the results would be averaged.

Let’s move on to the next step in backpropagation:

Fig.4: Calculating gradients in the last layer

We already know that O3 should be increased in order to decrease the loss. Let’s now think about how O3 can be changed. It can be affected in two ways: by changing the input (I3) or by modyfing the weight (w3). Let’s conider how these changes affect the loss:

Fig. 5: Derivatives of the loss function with respect to the input to the third layer and its weight.

As you can see we can determine how the loss would change with respect to the weight and input in a layer by looking at the “error” of the following layer and the input. This is the core idea behing the backpropagation algorithm. The derivative dL/dw tells us how the weight should be updated. We keep this information for later. The other one (dL/dI) is being passed to the previous layer in the same manner as before — we need it to calculate derivatives in the next step (in the previous layer). Putting it all together:

Fig. 6: Derivatives used in backpropagation.

The derivatives on the right-hand side are used to update the weights. This is the final step of backpropagation algorithm. As you can see the derivatives are pretty big, e.g. dL/dw(1) tells us that if we increased w(1) by some value, the loss function would decrease by 72 times that value. That’s why we use small learning rates to update the weights:

Fig. 7: Updating the weights. The loss decreased from 36 to 29.16.

Let’s go back to the implementation of the layer object and add 3 more methods:

  • compute_input_errors (dL/dI)
  • compute_gradients (dL/dw (for weights), dL/db (for biases))
  • update_params —for simplicity let layers handle updating their parameters provided the gradients (dL/dw) and learning rate (see Fig. 7)

Updated LinearLayer:

As you can see compute_input_errors requires output_errors dL/dI (see Fig. 6) and calculates its own errors with respect to the input. The only difference is that in general it can have more than one input neuron, so the error has to be calculated with respect to each input node separately. Notice, that each input neuron affects all ouput nodes, so the effect has to be added (line no 23). Bias term is not relevant here, because it is added to the output neurons — it has nothing to do with the input.

Let’s also take a closer look at compute_gradients. One thing that was not yet discussed is the bias term. It is pretty easy, though. As the bias is simply added to the output neuron, dL/db will be equal to the error for that neuron.

One last thing that we need to do is add some methods to the model object:

Let’s go through the code starting from the fit method. In order to train a network we usually need to go through the dataset many times. We call each run an epoch. In this implementation we can only provide one sample at a time. That’s what we are doing in line no. 40. We start with a forward pass that gives us our first prediction (y_hat). Notice that all the outputs generated in the forward pass are stored for later use. We use them in backpropagation_step. We start with computing the loss gradient and enter the loop where we go one layer at a time and calculate gradients. Just as discussed earlier. For each layer we store gradients that are needed to update parameters. In case of linear layers (for now they’re the only ones) they are dL/dw and dL/db. The gradients are used with update_layers method that goes through each layer and modifies its parameters. Lastly, the loss over the whole dataset is calculated to see if the training converges.

Adding nonlinearity

Stacking linear layers doesn’t make much sense. Let’s equip our model with a simple activation function — ReLU:

The layer will pass only the inputs that are greater or equal to zero (see the if/else statement in the __call__ method). This layer has no parameters so we only need to tell how the output of the layer is affected by the input (see compute_input_errors). This is fairly simple as the output is the same as the input unless the input was less than 0 (see line 11).

Learning nonlinear function with ReLU activation functions/

As you can see the model can already learn a simple non linear function.

I hope that now you have a much better understanding of how neural networks work. Although the code is highly inefficient as all the calculations are made on lists (one element at a time), I think it is worth to do it once this way.


Photo by Kevin Canlas on Unsplash

AI has been developing at an alarming rate for the last several years. There are already plenty of high-level frameworks that enable you to build the model and run the training without having to worry about all the technical details. That’s great news. Not only does it make things easier, it also makes AI algorithms available to more people. On the other hand, it comes at a price. We don’t have to deeply understand the details of neural networks (NNs) anymore and this may pose a threat. In order to be great at what you do you need to go deeper. Just to be clear — I don’t mean we should all go back to the beginning and code everything by ourselves. What I mean is we should go back to the basics from time to time and make sure that we understand what is actually going on when we train our models. In my opinion the best way to check if you really understand how neural networks learn is to implement the whole process all by yourself.

The article is going to describe the process of implementing some basic neural networks in Python. I intentionally limited myself to use only standard Python library. I wanted the article to be valuable also to those who are not familiar with linear algebra, so they can gain intuition about how neural networks learn. Let’s get started.

Before we dive into details, let’s mention some basic facts about neural networks:

  • a neural network consists of layers that perform certain calculations on the input and pass the calculated values to the next layer.
  • the calculations performed in the layers are dependent on some numbers called weights
  • the weights can be modified in a process called training in order to change the transformation of the input in such a way that the output fits the training data better in terms of some objective
  • the objective is called Loss function. It tells us how wrong the model is e.g. when training a model to predict person’s age based on their voice the loss could be the mean absolute error in years over the training samples.

If you’re unfamiliar with these concepts or want to learn more I recommend a great series on Deep Learning by 3Blue1Brown.

The code is available here. I wanted the implementation to be easy to follow in terms of the scope that is discussed in the article. It is highly inefficient as there is no vectorization and linear algebra involved.

Forward pass

Let’s start with building everything that is needed to perform calculations on the input. Neural networks are made up of layers, so we definitely are going to need an object to represent a layer. Let’s make it an abstract class. By doing so we don’t have to think how to implement things in details yet.

As we already defined a template of a layer we can create a model (Neural Network).

For now the model can only perform the forward pass on the input and return the output. The output of each layer is stored in attribute outputs. It is going to be needed later.

One more thing that we need to do is to implement a layer as we only have a template. As you might expect we are going to create a linear layer. Before diving in the code, let’s recap what a linear layer does:

Fig. 1: Linear layer with 2 inputs and 2 outputs

As you can see we need to define the size of the input (how many numbers are going to be passed to the layer) and the size of the output. We can also make a distinction between the parameters by which the inputs are being multiplied (wjk) and the parameters that are being added to the output (bk). We are going to call the former weights and the later biases. Let’s also stick to the commonly used index convention: wjk is the weight between j-th input neuron and k-th output neuron. With all that in mind we are ready to code the linear layer.

The above code does two things:

  • initializes weights (random numbers in range (-0.1, 0.1)) and biases (random number from 0 to 1).
  • computes the output of the layer by applying linear transformation (see Fig. 1)

Now we can already create a model that consists of linear layers. However, it’s not really useful until we can train it. Let’s implement backpropagation.

Backpropagation

Again, before moving on to the implementation, let’s think about this remarkable idea which makes training neural networks possible — backpropagation. Well, it may not be 100% correct to talk about backpropagation algorithm in isolation from linear algebra, but for educational purposes let’s assume we will. Although I think it’s generally a good idea to get familiar with some basics of linear algebra, it may be easier to understand the backpropagation by looking at one parameter at a time. Let’s get started.

Fig. 2: Simple neural network with 3 parameters (without biases).

For simplicity, let’s consider a very simple neural network that expects one number (feature), performs three linear transformations (with bias=0) and outputs one number (see Fig.2). Also, let’s consider only one training sample (X=1, Y=12).

I like to think about backpropagation as a tool that enables you to disassemble the neural network and look at one layer at a time. The first step in the algorithm is to take the loss and see how it would change if we changed the output of the model (O3):

Fig.3 Starting backpropagation with calculating the error at the last layer

What does dL/dO3 tell us? It tells us that if we increased O3 by a small amount, the loss would decrease 12 times more. Notice that if O3 was higher than Y dL/dO3 would be positive. We need this information to know if O3 should be increased or decreased. Also, if O3 is equal to Y the derivative (dL/dO3) is 0 — the model is perfect and O3 is just right. OK, Let’s implement this part:

As you can see, loss object needs two methods:

  • compute_cost that returns the value of loss function for given prediction (y_hat)
  • compute_loss_grad which is the same as dL/O3 (see fig. 3)

Notice that we take the first index of y and y_hat. This is because in current implementation linear layers output lists even when it’s only one number. For consistency the target (y) is also passed as a one element list. As we are going to train with batch_size=1, it’s OK. In proper implementation they would probably be arrays or tensors and the results would be averaged.

Let’s move on to the next step in backpropagation:

Fig.4: Calculating gradients in the last layer

We already know that O3 should be increased in order to decrease the loss. Let’s now think about how O3 can be changed. It can be affected in two ways: by changing the input (I3) or by modyfing the weight (w3). Let’s conider how these changes affect the loss:

Fig. 5: Derivatives of the loss function with respect to the input to the third layer and its weight.

As you can see we can determine how the loss would change with respect to the weight and input in a layer by looking at the “error” of the following layer and the input. This is the core idea behing the backpropagation algorithm. The derivative dL/dw tells us how the weight should be updated. We keep this information for later. The other one (dL/dI) is being passed to the previous layer in the same manner as before — we need it to calculate derivatives in the next step (in the previous layer). Putting it all together:

Fig. 6: Derivatives used in backpropagation.

The derivatives on the right-hand side are used to update the weights. This is the final step of backpropagation algorithm. As you can see the derivatives are pretty big, e.g. dL/dw(1) tells us that if we increased w(1) by some value, the loss function would decrease by 72 times that value. That’s why we use small learning rates to update the weights:

Fig. 7: Updating the weights. The loss decreased from 36 to 29.16.

Let’s go back to the implementation of the layer object and add 3 more methods:

  • compute_input_errors (dL/dI)
  • compute_gradients (dL/dw (for weights), dL/db (for biases))
  • update_params —for simplicity let layers handle updating their parameters provided the gradients (dL/dw) and learning rate (see Fig. 7)

Updated LinearLayer:

As you can see compute_input_errors requires output_errors dL/dI (see Fig. 6) and calculates its own errors with respect to the input. The only difference is that in general it can have more than one input neuron, so the error has to be calculated with respect to each input node separately. Notice, that each input neuron affects all ouput nodes, so the effect has to be added (line no 23). Bias term is not relevant here, because it is added to the output neurons — it has nothing to do with the input.

Let’s also take a closer look at compute_gradients. One thing that was not yet discussed is the bias term. It is pretty easy, though. As the bias is simply added to the output neuron, dL/db will be equal to the error for that neuron.

One last thing that we need to do is add some methods to the model object:

Let’s go through the code starting from the fit method. In order to train a network we usually need to go through the dataset many times. We call each run an epoch. In this implementation we can only provide one sample at a time. That’s what we are doing in line no. 40. We start with a forward pass that gives us our first prediction (y_hat). Notice that all the outputs generated in the forward pass are stored for later use. We use them in backpropagation_step. We start with computing the loss gradient and enter the loop where we go one layer at a time and calculate gradients. Just as discussed earlier. For each layer we store gradients that are needed to update parameters. In case of linear layers (for now they’re the only ones) they are dL/dw and dL/db. The gradients are used with update_layers method that goes through each layer and modifies its parameters. Lastly, the loss over the whole dataset is calculated to see if the training converges.

Adding nonlinearity

Stacking linear layers doesn’t make much sense. Let’s equip our model with a simple activation function — ReLU:

The layer will pass only the inputs that are greater or equal to zero (see the if/else statement in the __call__ method). This layer has no parameters so we only need to tell how the output of the layer is affected by the input (see compute_input_errors). This is fairly simple as the output is the same as the input unless the input was less than 0 (see line 11).

Learning nonlinear function with ReLU activation functions/

As you can see the model can already learn a simple non linear function.

I hope that now you have a much better understanding of how neural networks work. Although the code is highly inefficient as all the calculations are made on lists (one element at a time), I think it is worth to do it once this way.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment