eXtreme Gradient Boosting (XGBoost)

XGBoost, similar to Random Forest algorithm, builds an ensemble of decision trees.

Boosting: It utilizes gradient-boosted decision trees, and instead of training the models in parallel, it trains the models sequentially. With it’s sequential training, each decision tree can learn from the errors produced by the previous model. This type of sequential training is also called boosting.

Gradient Boost: Boosting using Weak Learners(very simple models that only just perform better than random chance). The algorithm starts with an initial weak learner. Each subsequent model targets the errors produced by the previous decision tree. This continues until no further improvement can be made and results in a final strong learner model.

Info

XGBoost is considered the most accurate predictive technique for structured data.

Naive Model
Make Predictions
Calculate Loss
Train New Model
Add Model to Ensemble

Hyper-Parameters:

  • learning_rate: step size shrinkage fore each new tree added to the model
  • max_depth: maximum number of levels of each individual weak learner
  • subsample: % of observations used to train each tree
  • colsample_bytree: % of columns to use in training each subsequent tree.
  • n_estimatore: Number of trees to add to the model
  • objective: loss function to be trained on
  • Gamma: Controls whether a node will split based on the expected reduction in loss after a split. High gamma leads to Fewer splits.
  • Alpha: L1 Regularization on leaf weights
  • Lambda: L2 regularization on leaf weights.
Tip

A small learning rate and large number of estimators, often improves accuracy at the cost of computational resources.


Notes:

  • XGBoost makes use of similarity scores between leaves and the preceding node to decide which node is used as root and which is used as child nodes.
  • It can solve both classification and regression-based problems.

Advantages:

  • Compared to Gradient Boosting, it has Regularization(Both L1 Regularization & L2 regularization) which prevents the model from Overfitting.
  • It's capable of parallel processing at the node level allowing it to be faster than Gradient Boosting (GB).
  • It's great at handling Missing Values.
  • It allows cross validation, which lets finding out the exact optimum number of boosting iterations in a single run easy.
  • It can be considered a greedy algorithm, as it stops splitting a node when it encounters a negative loss in the split.
  • It’s one of the most flexible algorithms for structured data and works well with both large and small datasets, however it doesn’t perform well on very sparse or unstructured data.

Disadvantages:

  • Due to the mechanism sequential learning from the errors of previous models, it’s sensitive to outliers.
  • If the training is not stopped in time, it's prone to Overfitting.