Modern Data Analysis: Random Forests

Random Forests

Like mentioned in the post about decision trees, the big challenge to face is overfitting. To adress the issue, a concept developed by Tin Kam Ho has been introduced and later named and implemented by Leo Breiman. In his approach he tries to create distance from the training data (no need for test or cross validation dataset) by considering randomly chosen subsets, so called bootstaps. The remaining part, the so called Out-of-Bag data (e.g. one third of the training data is used to validate the classification). In addition for the contruction of each tree only a randomly chosen subsets (e.g. one third for regression problems and sqrt for classification) of the splitting features is taken into account.
The final result of a request is then the aggregated result of all the decision trees. So the approach makes use of the fact that cumulated decisions in a group in general yields to better results than individual decisions. The naming random is thereby a nice and intuitive wordplay.

As the different trees are independent of each other the evaluation of a random forest can be parallelized. Also it reduced the high variance that is often created by a single decision tree. For further information about the advantages and disadvantages of a random forest I refer to Leo Breimans site https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

R has an own library "randomForests" for random forests.

Modern Data Analysis

Random Forests

Mirko

0 Comment