Activation functions are mathematical formulas that help determine the output of a neural network by introducing nonlinearity to it and generating output from a collection of input values fed to a layer.
These type of functions are attached to each neuron in the network and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction.
In a neural network, inputs are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold.
Choosing the correct activation function will improve ANN's ability to learn, generalize, and it's speed and convergence. They are often chosen after examination and experimentation, however generally they can be chosen based on model type:

Linear: does not change the weighted sum of the input in any way and instead returns the value directly

Logistic (Sigmoid)

Softmax: most popular activation function for output layer

If we encounter a case of dead neurons in our networks the leaky ReLU(Rectified Linear Unit) function is the best choice

Due to vanishing gradients problem Sigmoid and TanhTanh(Hyperbolic Tangent) functions are no longer generally used. ReLU(Rectified Linear Unit) and it’s other types are now the default activation function in hidden layer if ANNs.

Sigmoid can be used as an alternative to ReLU(Rectified Linear Unit) in some classification problems.

ReLU(Rectified Linear Unit) function should only be used in the hidden layers.

Softmax is generally used in the output layer to normalize the output received and find how close the result was to the original value and by how much. It outputs a vector of values that sum to 1.0 which can be interpreted as probabilities of class membership

ReLU(Rectified Linear Unit) is the best option in hidden layers and in case of problems or for optimization it can be changed with it’s subtypes.
Considerations regarding activation functions:

Range and curves of activation functions is often used to categorize them.

Activation Function’s Range of output: The activation function helps to normalize the output of each neuron to a range between 1 and 0 or between 1 and 1.

Activation Function’s Equations

Model type and application for Activation Function selection
Problems in activation functions:

Vanishing gradient is a common problem encountered during neural network training. Like a sigmoid activation function, some activation functions have a small output range (0 to 1). So a huge change in the input of the sigmoid activation function will create a small modification in the output. Therefore, the derivative also becomes small. These activation functions are only used for shallow networks with only a few layers. When these activation functions are applied to a multilayer network, the gradient may become too small for expected training.

Exploding gradients are situations in which massive incorrect gradients build during training, resulting in huge updates to neural network model weights. When there are exploding gradients, an unstable network might form, and training cannot be completed. Due to exploding gradients, the weights’ values can potentially grow to the point where they overflow, resulting in loss in NaN values.