Open map List

Random forest

Introduction

RF generates an ensemble of decision trees during training.

Each tree is the result of applying CART to a random selection of attributes/features at each node

And of using a random subset of the original input data (chosen with
replacement, -- bootstrapping || Bagging = bootstrapping aggregation)

Response variables are obtained by voting/averaging over the ensemble (simple majority)

Advantages

Input data: N training cases, each with M variables

n out of N samples are chosen with replacements (bootstrapping).

Rest of samples to estimate error of the tree (out of bag).

m << M variables are used to determine the decision at the node of the tree

Each tree is fully grown and not pruned

Output of the ensemble: aggregation of the output of the trees

Advantages

Disadvantages

Summary of Steps in Random Forest with Bootstrapping

Create Multiple Bootstrap Samples: Generate multiple bootstrap samples from the original dataset.
Train Decision Trees: Train a decision tree on each bootstrap sample.
Aggregate Predictions: For classification, use majority voting. For regression, average the predictions.
Evaluate with OOB Error: Use the out-of-bag samples to estimate the performance of the model.

Example

Assume you have a dataset with 100 data points. To build a Random Forest with 10 trees:

Generate 10 Bootstrap Samples: Each sample contains 100 data points selected with replacement from the original dataset.
Train 10 Trees: Train one decision tree on each bootstrap sample.
Predict and Aggregate: For a new data point, each tree makes a prediction. For classification, the class with the most votes is the final prediction. For regression, the final prediction is the average of the tree predictions.
Calculate OOB Error: For each data point, use the trees that did not include it in their bootstrap sample to make a prediction and compare it to the true value to estimate the error.

The algorithm selects random samples from the dataset provided.
The algorithm creates a decision tree for each sample selected, and then get prediction result from each decision tree.
Voting is performed for every predicted result. A classification problem uses mode, a regression problem uses mean.
Finally, the algorithm selects the most voted prediction result as the final prediction.