Vanishing gradient is a common problem encountered during Artificial Neural Networks (ANN) training. Some Activation Functions such as Sigmoid or Tanh have a small output range (0 to 1). So a huge change in the input of the sigmoid activation function will create a small modification in the output. Therefore, the derivative also becomes small. These activation functions are only used for shallow networks with only a few layers. When these activation functions are applied to a multi-layer network, the Gradient may become too small for expected training.

Notes:

- Feed-Forward Neural Networks (FFNN) and Recurrent Neural Networks (RNN) often suffer from vanishing gradient problem.
- Typically there are two causes for vanishing gradient problem:
- Bad Neural Network architecture choice or Activation Function selection.
- Small Weights in initialization, that can exponentially shrink during training.

- Solutions:
- Change activation function to alternatives such as ReLU (Rectified Linear Unit)
- Change to suitable architectures such as Residual Networks (ResNets), LSTM or GRU.
- Proper Weight Initialization
- Batch Normalization
- Gradient Clipping

Interactive Graph

Table Of Contents