Decision Tree

A Decision Tree is a tree-like graph with nodes representing the place where we pick an attribute and ask a question; edges represent the answers to the question, and the leaves represent the actual output or class label.  The decision tree algorithm can solve regression and classification problems.

Introduction

Terminology:

  1. Root Node: It represents the entire population or sample, and this further gets divided into two or more homogeneous sets.
  2. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.
  4. Branch / Sub-Tree: A subsection of the entire tree is called a branch or sub-tree.
  5. Parent and Child Node: A node, which is divided into sub-nodes, is called the parent node of sub-nodes, whereas sub-nodes are the child of the parent node.
  6. Splitting: It is a process of dividing a node into two or more sub-nodes.
  7. Pruning: Pruning is when we selectively remove branches from a tree. The goal is to remove unwanted branches, improve the tree’s structure, and direct new, healthy growth.

 

Explanation

The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).

Advantages:

  • Output is easy to understand
  • Can combine numeric and categorical data
  • Robust (outliers)
  • Fast (after developing the rules)

Disadvantages:

  • Overfitting
  • Limited to the range of the attributes in the training data
  • Unstable (small perturbation input à larger perturbation output)

Examples

How to

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of the root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and moves further. It continues the process until it reaches the leaf node of the tree. The complete algorithm can be better divided into the following steps:

  • Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
  • Step-2: Find the best attribute in the dataset using the attribute selection measure.
  • Step-3: Divide the S into subsets that contain possible values for the best attributes.
  • Step-4: Generate the decision tree node, which contains the best attribute.
  • Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and call the final node a leaf node.

Outgoing relations

Incoming relations