Vanishing Gradient Problem in Deep Learning

Santosh Thapa
3 min readNov 23, 2020

Artificial Neural Network is actually not a new concept. However, it is true that its popularity is increasing now exponentially. The earlier neural network was coined in the 1950s however, initially, it was not much popular. It was developed but it failed to achieve what it was expected to do. The main reason behind this was the use of an Activation Function using sigmoid in each and every neuron. The problem using sigmoid as activation function created as a problem which was known as “Vanishing Gradient Problem”. In those days, activation function like Relu was not invented and because of this problem along with Exploding Gradient Problem, the deep neural network was abandoned for a long time.

Vanishing Gradient Problem was a type of problem encountered in earlier days of Deep Learning where the convergence of the gradient descent was lost at some point when there is a deep neuron inside the networks.

Maths behind Gradient Descent:

The output of the sigmoid function ranges from 0 to 1. However, the derivative of the sigmoid function itself ranges between 0 to 0.25 which caused a great problem while using the chain rule in backpropagation in a deep neural network where the gradient descent vanished and it has no effect on the output.

Here we can clearly see that derivative of the Sigmoid function itself lies in the range of 0 to 0.25.

Say, I have a single hidden layer input layer as given in the above figure, and say, I did not achieve the desired accuracy so I want to backpropagate inside the networks. For that, let's take one weight for now. Here, I am taking the weights w11 suffixed 2 and I want to update it using gradient descent using the chain rule. We can see that after applying the chain rule the value of the derivative of the sigmoid function will be in the range 0 to 0.25. Here, I am just assuming the weights, in real-time it can be such weights around 0 to 0.25, so I am taking values in between them.

Now, as soon as I apply the derivative to my new weights it value becomes 1.298 which doesn’t seem much affected here since my network is just with one hidden layer. Now, imagine if there were 100s of hidden layers in such case if we use sigmoid function then we can easily see that at certain time derivatives will become very smaller that w11 suffixed 2 old and w11 suffixed 2 new will be exactly same, there will be no any effect of the chain rule. So, this case is what we called our derivative has vanished or it is lost.

So in order to solve this problem we can use other activation functions like the Relu activation function and the leaky Relu activation function.

--

--