Neural Networks — A Beginner’s Guide (1.1) | by Shweta | Mar, 2023

By Jessie Hobb On Mar 30, 2023

Building intuition about Neural Networks

Deep Learning has witnessed tremendous growth in the last decade. With applications in image classification, speech recognition, text to speech conversion, self driving cars etc., the list of problems that Deep Learning has addressed is very significant. It is therefore necessary to understand the basic structure and working of Neural Networks to appreciate these advancements.

Let us deep dive into learning.

A neural network is a computational learning system that maps input variables to the output variable using an underlying mapping function that is non linear in nature.

It comprises five essential components:

a. Nodes and Layers

b. Activation Function

c. Loss Function

d. Optimizer

We will learn about each of these components in detail.

Simply put, a Neural Network is a stack of layers, inter connected to each other. There are three types of layers in a Neural Network : Input Layer — takes the input data , Hidden Layer — transforms the input data, Output Layer — generates prediction for the given inputs after applying transformations. The layers close to the Input Layer are called the Lower layers, the layers close to the Output Layer are called the Upper Layers.

Each layer consists of multiple neurons, also called Nodes. Each node in a given layer is connected to each node in the next layer. The nodes take the weighted sum of the inputs from the previous layer, applies a non linear activation function to it and generates an output which then becomes an input to the nodes in the next layer.

Consider a common classification problem of predicting whether a loan applicant will default. The input variables include factors like applicant age, employment type, number of dependents, place of residence, LTV ratio, etc. These variables will make up the input layer.

The number of nodes in the input layer correspond to the number of independent variables in the data . The number of hidden layers and the nodes in these layers is a hyperparameter and usually is a function of the complexity of the problem and the data available.

In a complex problem, the number of layers and nodes each layer will be more, each hidden layer will learn representations not learned at the previous layer. These neural nets are called ‘Deep Neural Networks’.

For a regression problem, the number of nodes in the output layer is one; for a multiclassification problem, the number of nodes in the output layer is equal to the number of labels/categories, for a binary classification problem, the number of nodes in the output layer is equal to 1.

The working of a neural network can be broken down to a single node in a given layer.

***Working of a Single Node in a Neural Network (Image by Author)***

As shown above, the single node takes in the following inputs — bias b and input variables x1 and x2. It also takes in another parameter as inputs — the weights for each independent variable. The weights indicate the importance of the input variables.

The node will process the weighted sum of the inputs given as:

z = w1x1 + w2x2 + bias (Equation 1)

An activation function is then applied at each node in a given layer to give an output. The output generated by the node after applying activation function is a.

f(z) = a (Equation 2)

This is the working of a single node in a single layer of a neural network. Networks with multiple layers and nodes also operate using the same principle.

**2 Layer Neural Network (Image by Author)**

In addition to the weighted inputs, we can see that there is another term called bias ‘b’ in the equation 1 above. What is the role of bias in a neural network?

A Bias is a variable that helps in activation of the node. Bias is the negative of the threshold value that is required to activate the node. There is a single bias value used for all nodes in a given layer.

The data in batches is passed through the input layer which sends it to the first hidden layer. The neurons in the first hidden layer will activate based on the output of the activation function, that takes in the weighted sum of the inputs and the bias and computes a number in a specific range.

This brings us to the next question — What is an activation function and why do we require it?

In simple terms :

An activation function is used to transform the input from a node to an output value that is fed to the node in the next hidden layer.

In technical terms:

An activation function, also known as a transfer function, defines how the weighted sum of the inputs and the bias is transformed into an output from the node in a given layer. It maps the output value in a given range i.e. 0 to 1 or -1 to +1 depending on the type of function used.

Activation functions used in Neural Networks are of two types — linear and non-linear.

Linear Activation Function:

The equation is given by f(x) = b+ Sigma( wi * xi), indexed over all input variables (i)

The range of this function is : — infinity to + infinity.

A linear activation function is used in outer layer of the neural network when solving regression problems. It is not a good idea to use it in the input or hidden layers cause the network will not be able to capture the complex relationships in the underlying data.

2. Non-Linear Activation Function:

Non Linear activation functions are by default, the most used activation function in Deep Learning. These include Sigmoid or Logistic function, Rectified Linear Activation(ReLU), and Hyperbolic Tangent (Tanh).

Let’s understand each of them in more detail.

Sigmoid Activation Function:

Also called the Logistic function, it takes in any real value as input and gives an output in the range of 0 and 1.

Given as y = 1/(1+ e^-z), it has a S shaped curve. Here z = b + sigma(xi * wi), indexed over i input variables.

For a very large positive number z, e^-z will be 0 and the output of the function will be 1. For a very large negative number z, e^-z will be a large number and thus the output of the function will be 0.

2. Rectified Linear Activation (ReLU):

It is today, the most used activation function. ReLU has a property of being linear for all input values greater than 0 and non-linear otherwise.

It is given as f(x) = max(0,x)

3. Hyperbolic Tangent Activation:

Similar to the logistic function, it takes in any real number as an input and outputs a value in the range of -1 and +1.

It is given as : f(x) = (e^z — e^-z) / (e^z+e^-z). Here z = b + sigma(xi * wi), indexed over i input variables.

The shape of Tanh function is also S shaped but the range is different.

Generally one activation function is used across all layers, exception being the output layer. The activation function used in the output layer depends on whether the problem statement requires us to predict a continuous value i.e. Regression or a categorical value i.e. Binary or Multi label classification.

A neuron can thus be defined as an operation that has two parts — linear component and an activation component i.e. Neuron = Linear + Activation.

All the functions mentioned above along with its variants have some limitations which I will cover in the next article.

So how does a Neural Network learn?

The weights of all parameters are initialized with some random values. The weighted sum is passed to the first hidden layer of the network.

The first hidden layer will compute the output of all the neurons and will pass it to the neurons in the next hidden layer. Do note that the input values at each layer is being transformed by the activation function and then sent to the next layer.

This flow continues till the last layer is reached which then computes the final prediction. This unidirectional flow from the Input Layer to the Output Layer is called a ‘Forward Pass’ or ‘Forward Propagation’.

Our network has now generated a final output. What happens next?

The predicted value is compared with actual value and the error is computed. The magnitude of the error is given by the loss function.

The loss function will estimate how close the distribution of the predicted value is to distribution of the actual target variable in the training data.

The Maximum Likelihood Estimation (MLE) framework is used to compute the error over the entire training data. It does this by estimating how closely the distribution of the predictions matches with the distribution of the target variable in the training data.

The loss function under the MLE framework for classification problem is Cross Entropy, and for regression problem is Mean Squared Error.

Cross Entropy gives the measure of the difference between two probability distributions of a random variable. In the context of the Neural Networks, it gives the difference between the predicted probability distribution and the distribution of the target variable in the training data set for a given set of weights or parameters.

For a binary classification problem, the loss function used is binary cross entropy and for a multiclass classification problem, the loss function used is categorical cross entropy.

For e.g. consider a binary classification problem related to customer loan default. Suppose the training data consists of 5 customers.

The neural network in the first forward pass will compute the probability of a customer to default . The output generated by the network for all the 5 customers respectively is [0.65, 0.25,0.9,0.33,0.45].

The actual values for the observations in the training data is [1,0,1,0,1].

Cross Entropy Loss is given as:

Using this equation, the cross entropy loss (CEL) for the above problem is calculated as:

Here, the binary cross entropy calculates the score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The loss given the actual and predicted values of the target variable is 0.404. How do we interpret this value? It has a relative interpretation. The final model will have a loss value much lower than 0.404. The fifth and the last building block will enable us to reach that optimal value. It does this by searching for the most optimal values of weights and bias that miminizes the loss function.

In case of a multiclass classification problem, where the target variable is encoded as 1 to n-1 categories, the categorical cross entropy will calculate the score that summarizes the average difference between the actual and predicted probability distributions for all the classes.

Similarly for Regression, Mean Squared Error (MSE) is the most commonly used loss function for a regression problem. MSE is calculated as the average of the squared difference between the predicted and actual values of the target variable. The output is always positive as it is a square of the error.

There are variants to the MSE like the Mean Squared Logarithmic Error Loss (MSLE) and Mean Absolute Error (MAE). The choice depends on number of factors like presence of outliers , distribution of the target variable and others.

The output generated by the network in the first forward pass is a result of the weights that were initialized to some random values. The loss function compares the actual and predicted values and computes the error. The next step is to minimize the error by changing the weights. How does the network achieve this?

This brings us the last building block of Neural Network i.e. Optimizer.

5. Optimizer

As we discussed in the earlier section, in the neural network, the learning takes place in the weights. Training a neural network involves learning the correct weights associated with all the neurons in all the layers. This is achieved by using Stochastic Gradient Descent algorithm together with the Back Propagation algorithm.

Given this is a much more complex concept as compared to the ones covered above, let us look at this in detail in the next article. All the building blocks covered here also deserve a more detailed explanation which will be done in the subsequent articles.

The key takeaway from this article is that the final Neural Network model is a function of the overall architecture i.e number of nodes, layers etc. and the optimal value of the parameters a.k.a. weights. Once we have addressed both these components, we can go ahead with confidently predicting the target variable.

Here are the few links I found extremely helpful in understanding this concept.

https://youtu.be/PySo_6S4ZAg — This is the Stanford CS230 course on Neural Networks by Andrew Ng.
https://amzn.eu/d/6U4c3GR — Deep Learning with Python, Second Edition. Amazing book. The concepts are explained in a very simple language.
https://machinelearningmastery.com/ — one source for all basic and intermediate questions on Deep Learning and Machine Learning.

Hope by now you have some understanding of what neural nets are and how the various building blocks come together to solve a deep learning problem. Let me know your thoughts.