Random forest

Introduction

RF generates an ensemble of decision trees during training. 

Each tree is the result of applying CART to a random selection of attributes/features at each node

And of using a random subset of the original input data (chosen with
replacement, -- bootstrapping || Bagging = bootstrapping aggregation)

Response variables are obtained by voting/averaging over the ensemble (simple majority)

Advantages

  • Fast and easy to implement
  • Highly accurate predicition (even with high dimensionality of input data)
  • No overfitting
  • Provide insight on the importance of each attribute/feature/dimension
  • Easily parallelizable
  • No data pre-processing needed (e.g. normalizing)

 

Explanation

Input data: N training cases, each with M variables

 

n out of N samples are chosen with replacements (bootstrapping).

Rest of samples to estimate error of the tree (out of bag).

m << M variables are used to determine the decision at the node of the tree

Each tree is fully grown and not pruned

 

Output of the ensemble: aggregation of the output of the trees

Advantages

  • No pruning needed
  • High Accuracy
  • Provides variable importance
  • No overfitting || Not very sensitive to outliers

Disadvantages

  • Cannot predict (regression) beyond range of input parameters
  • Smoothing extreme values (underestimate high values; overestimate low values)
  • More difficult to visualize/interpret

Examples

Summary of Steps in Random Forest with Bootstrapping

  1. Create Multiple Bootstrap Samples: Generate multiple bootstrap samples from the original dataset.
  2. Train Decision Trees: Train a decision tree on each bootstrap sample.
  3. Aggregate Predictions: For classification, use majority voting. For regression, average the predictions.
  4. Evaluate with OOB Error: Use the out-of-bag samples to estimate the performance of the model.

Example

Assume you have a dataset with 100 data points. To build a Random Forest with 10 trees:

  1. Generate 10 Bootstrap Samples: Each sample contains 100 data points selected with replacement from the original dataset.
  2. Train 10 Trees: Train one decision tree on each bootstrap sample.
  3. Predict and Aggregate: For a new data point, each tree makes a prediction. For classification, the class with the most votes is the final prediction. For regression, the final prediction is the average of the tree predictions.
  4. Calculate OOB Error: For each data point, use the trees that did not include it in their bootstrap sample to make a prediction and compare it to the true value to estimate the error.

How to

  • The algorithm selects random samples from the dataset provided.
  • The algorithm creates a decision tree for each sample selected, and then get prediction result from each decision tree.
  • Voting is performed for every predicted result. A classification problem uses mode, a regression problem uses mean.
  • Finally, the algorithm selects the most voted prediction result as the final prediction.

Outgoing relations

Incoming relations

Contributors

  • Sander Mooren