ReLU Activation Function, Leaky-Relu

3 min readNov 24, 2020

An activation function is a function that decides whether the nodes should be activated or not on the basis of weights and bias assigned to it. The activation function takes an input and gives a non-linear output if an activation function used for the model is a non-linear activation function like ReLU. Before talking about the ReLu activation function, let me talk about why it is popular in comparison to earlier non-linear activation functions like sigmoid and hyperbolic- tangent function, why ReLu is the most popular choice of deep learning today.

Well, sigmoid and tangent was very much popular non-linear activation function that was used in the earlier days. However, the problem they faced while training a model using such activation functions was the Vanishing Gradient Problem where the value of derivative in chain rule during back propagation vanished as soon as the neural network became deep.

So, ReLU came into existence to solve this problem:

Since the value of z becomes positive as long as the value is greater than zero and negative as soon as the value is less than zero, ReLu is considered as the best activation function since it does not consists of heavy computations as there is no any heavy or complicated maths involved. So, it takes less time for the model to use the ReLu activation function to train. Similarly, it is a non-saturating activation function, unlike sigmoid and hyperbolic-tangent function.

One thing to note in the ReLu activation function is that it is linear for half of the input domain that is for the positive value and non-linear for half of the other sizes. So, it is also referred to as a piecewise linear function or a hinge function.

One of the most important advantages of using the ReLu activation function is the achievement of sparsity in the datasets. In simple terms, a matrix in which most entries are 0 is called a sparse matrix. The sparsity in the model in Deep Learning is often considered as having good predictive power which ultimately decreases the chances of overfitting of the model. For example, let’s take an example of CNN(Convolutional Neural Network) where let’s assume that we are given the problem where we have to detect the face of Dog. Now, in this case, there will be the condition where the neuron that can identify an eye of the Dog which should not be activated if we show other pictures other than the dog say the picture of the house. Since ReLu gives output zero for all negative values or inputs so for the false input given to it, such neurons won’t be activated which causes the neural network to be sparsed.

However, one problem that lies within the ReLu activation function is that when the values are less than zero the neuron will be passive and this case or condition is called dying ReLu where it becomes passive. So, in order to solve this problem, we have another activation function that is called Leaky-ReLu.

Leaky-ReLu is nothing but the modification of the ReLu activation function where a small slope is assigned to ensure that the neurons won’t be passive whenever its value becomes less than zero. This slope can be considered as a leak to avoid the passivity of the function.

ReLU Activation Function, Leaky-Relu

Written by Santosh Thapa