Modern Data Analysis: Decision Trees

Decision Trees

One of the most important techniques for data analysis are decision trees. They are easy to understand, can illustrate complex rule sets and are an effective method to classify new datasets.

Are there any complications about decision trees? Sure there are!

There are quite a lot of algorithms trying to build decision trees. The common understanding to build them up following known rules is not the principal way decision trees are used in data science. In data science usually the rules are unknown, meaning the split of the data into the branches and the determination of the leaves is not known from the start. A good algorithm increases the purity on every split, meaning that the disjunct data subsets after a split are purer regarding a target variable that the data set before the split. In addition to determine and calculate purity the different algorithms also concentrate on questions around minimal size of leaves and number of splits, as one major problem of desicion trees is over-fittinig.

So how do you measure purity?

The simplest way to measure the purity of a split in a classifier decision tree would be to measure of the target variable in the created subsets and choose the split that generates the minimal proportion in one of the subsets.

I found further splitting criteria in Linoff/Berry's book on Data Mining Techniques (btw a very good and understandable reference book for data mining):

The Gini measure simply sums up the squares of the percentages of the target variable in the different subsets and assigns this value to the node. A bad split would not considerably change the proportion of a certain variable and therefore have a Gini measure around 1/n if n is the value of different values for the target variable. A good split however would have a Gini score close to 1 on the subsets. To get the score of the split the Gini score of the subsets are added weighted by the size of the subset.

The Chi-Square test can be used to meaure how good a split is. It measures how likely the split is due to chance, calculating the expected and observed proportions of each target value in every subset after the split and simply sums them up to get a measure for the split. For the subset the value is calculated as the square of ( expected - observed values ) / expected values for each value in the target, summed up. It comes into use in the Chi-Square Automatic Interaction Detector (CHAID) Algorithm for desicion trees, here it chooses the best split in every step of the desicion tree algorithm.

Modern Data Analysis

Decision Trees

Mirko

0 Comment