Random Forests builds a ‘forest’ of several Decision Trees, with each tree trained on a different subset of data to predict the outcome. It uses bagging technique and feature randomness when building each individual tree to create an uncorrelated forest of decision trees.
To generate results, each tree in the forest will generate a prediction from a given set of features and in the end the most commonly occurring prediction is chosen as the final prediction by majority vote.
Notes:
Random Forests can be used for both classification and regression tasks:
For classification tasks, the output of the random forest is the class selected by most trees.
It can be utilized in regression tasks, then the mean or average prediction of the individual trees is returned.
The training dataset is randomly split into multiple samples based on the number of trees in the forest. The number of trees is set via a hyperparameter and the optimal feature is used for splitting.
Random forest uses bootstrap replicas to generate subsets of size N for training ensemble members (Decision trees)
Random forest works well on large datasets with high dimensionality as the algorithm inherently performs feature selection.
It is not sensitive to outliers.
Random Forest can be prone to overfitting although this can be mitigated to some degree with pruning.
It provides a low level of interpretability, however by extracting “feature importance” it can be improved.