Backpropagation in Neural Networks

Do you know how a neural network trains itself to do some job? How does it learn? In this article, we will see the whole process of how a neural network learns.

The main goal of a network is to reduce the loss incurring while predicting the outputs. To minimize this loss, we will apply some optimization technique called Gradient descent. In this technique, we update the value of parameters while backpropagating in the network, i.e., find the derivates of the error function with respect to the weights to decrease the loss function and use this Gradient to update the current weight. To have the basic intuition about this technique, refer to the article below:

What does backpropagation mean?

It is a method to calculate the Gradient using the chain rule; then, gradient descent is used to update the weights, decrease the loss, and put some information in the network.

Let’s understand this backpropagation through a neural architecture.

The above network contains an input layer with two feature neurons and a bias neuron, a hidden layer with two hidden neurons, and a bias neuron activation function used in this layer be relu, output layer containing two units with sigmoid activation.  

The pre-activation function of hidden layer:

After applying activation(relu) on the above outputs:

Relu(x) = g(x) = max (0, x)

Pre – activation for output layers:

Applying activation(sigmoid) on the above outputs:

Sigmoid(x) = g(x) = 1/ (1 + e-x)

Let’s perform Gradient Descent with backpropagation.

Before moving forward, you should what is delta rule?

Delta rule is a learning rule through gradient descent for updating the inputs’ weights to the neurons in the artificial neural network. For more about the delta rule, go through the link:

https://en.wikipedia.org/wiki/Delta_rule/

The main aim to minimize the error, i.e., ½ (Yactual – Ypredicted)2.  we need to find the change in error with respect to the weights.

Let’s find the change in error with respect to W211 for reference, see the architecture given above:

Similarly, we can find W221, W222, W212. But for the weight between the input layer and the hidden layer, the derivatives are calculated as follows:

Similarly, we can find W112, W121, W122. Now we saw the change in error with respect to the weights; we can updates the weights using:

W = W – (learning rate * d(error)/d(W))

We will do this updating for multiple epochs till we found the global minima. In this normal gradient descent in one epoch, we see the error because of the whole dataset using forward propagation and update the weights once in the backpropagation. This Gradient Descent will be too slow to converge as the size of the dataset increases. So, there are some advancements in conventional gradient descent techniques like stochastic GD, mini-batch GD, adagrad, adadelta, and adam.

Thank You for the read! I hope it would help you.