Impurity

An impure node refers to a node that contains data points with mixed classifications/labels, where the data has not yet been perfectly separated into distinct categories.

Introduction

In decision tree algorithms, quantifying impurity is a crucial step for determining how to split the nodes of the tree. Impurity measures help identify the best feature and the best threshold for splitting the data at each node. The most common impurity measures are Gini impurity, entropy (information gain), and mean squared error (for regression trees). Below, I'll explain each of these impurity measures and how they are used in decision trees.

Explanation

  • Gini impurity measures the frequency at which any element of the dataset would be misclassified when randomly chosen. It is used by the CART (Classification and Regression Trees) algorithm for classification trees. A Gini impurity of 0 indicates perfect purity (all elements belong to a single class).A Gini impurity closer to 1 indicates higher impurity (more mixed classes). In a binary classification, a Gini value cannot be closer to 1 since the maximum Gini value is 0.5. However, in multi-class classification scenarios, a Gini value closer to 1 (though usually less than 1) implies higher impurity. It means the data at that node is more mixed and less pure, with a more uniform distribution of classes.
  • Mean Squared Error is used in regression trees to measure the impurity of a node by the variance of the target variable.
  • Entropy: Used for classification; measures the impurity using information gain.

Examples

How to

Impurity can be quantified using:

Gini coefficient: 1-(

(probability%20of%20yes)%5E2-%20(probability%20of%20no%5E2)

 

Weighted Gini = 1 - Σ (wi * ni / tc)^2

where:

  • wi - weight of the i-th sample
  • ni - number of samples in the i-th class within the node (can be 0 if the class is not present)
  • tc - total weight of all samples in the node (sum of all wi)

Outgoing relations