Decision Trees

A decision tree is a Tree-based Model used for Classification with a Tree Structured containing a series of conditional statements that determine what path a sample takes as it’s nodes and it’s output selected in leaf nodes.


types:

  • Categorical Variable Decision Tree: Decision Tree which has a categorical target variable then it called a Categorical variable decision tree.
  • Continuous Variable Decision Tree: Decision Tree has a continuous target variable then it is called Continuous Variable Decision Tree.

Main parameters:

  • Maximum tree depth
  • Minimum samples per leaf node
  • Measure of Impurity is used to calculate the importance of a feature in a decision tree. The more a feature decreases the impurity, the more important it is considered in making decisions within the tree.

Concepts:

  • Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  • Internal Nodes: they represents features.
  • Branches(Decision Nodes): When a sub-node splits into further sub-nodes, then it is called the decision node and represents decision rules.
  • Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node and represents the outcome.
  • Splitting: It is a process of dividing a node into two or more sub-nodes.
  • Stump: It is a decision tree with only one node and two leaves.
  • Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
  • Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
  • Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

Advantages:

  • A decision tree is a very intuitive and interpretable.
  • Easy to implement.
  • Fast to train.
  • Fast inference
  • Doesn't require normalizing the dataset
  • It's suitable for both binary and multiclass classification.
  • It's not affected by outliers. Because during splitting, outliers will stay in a branch that doesn't care about the magnitude of the variable. I.e. it uses instead of

Disadvantages:

  • It is not suitable for complex data.

Notes:

  • Decision Trees are used in Random Forest.
  • Decision Trees can predict both Categorical Variables and actual values(binary classification).
  • The impurity measure is used to decide the best way to split the data at each node of the tree. Thus choosing the Measure of Impurity is a critical decision in the construction of decision trees.
  • The goal of splitting nodes based on impurity measures is to maximize the homogeneity of the resulting child nodes, leading to more accurate and predictive models.