It's a Gradient (Calculus) based Optimization algorithm which involves computing gradients with respect to the model parameters (such as regression coefficients) at every iteration for the entire data set and converges directly to minima.

Process:

- Compute the slope (gradient). It's the first-order derivative of the function at the current point.
- Move down the slope from the current point by a computed amount.

Notes:

- Batch Gradient Descent is useful for convex or relatively smooth error manifolds.
- It decreases the cost over the epochs, because it's averaging over all the gradients of training data for a single step.
- It calculates for all the examples for every step of Gradient Descent, so it's not suitable for very large datasets. In this cases Stochastic Gradient Descent(SGD) is suggested.

- Convergence to global minima is theoretical guaranteed.
- In some cases faster compared to Stochastic Gradient Descent(SGD) due to decreased update frequency.
- More stable error gradient due to decreased update frequency.
- Implementing parallel processing is simple because calculating errors prediction and model update can be separate.

- Possible premature convergence because of stable error gradient.
- At the end of each epoch accumulating prediction errors across all training should be calculated.
- High memory requirement, as all dataset should be available in memory.
- It can be slow in very large datasets, as it's computationally expensive.

Interactive Graph

Table Of Contents