Decision Trees

Decision tree is a very simple yet widely used classification technique. The non-terminal nodes, which include root and other internal nodes, contain attribute test conditions to separate records that have differenct characterstics. Consider the following example for classification of mammals.

In the above example the attribute Body Temprature is used to separate warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertibrates are Non Mammals, a leaf node Non-Mammal is created as the right child of the root node. If the vertibrates is warm-blooded, a subsequent attribute Gives Birth, is used to distinguish mammals from other warm-blooded creatures, which are mostly birds. Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test.

Building a decision tree

In principle, their are exponentially manu decision trees that can be constructed from a given set of attributes. While some of the trees are more accurate than others, finding the optiman tree is computationally infeasible because of the exponential size of the search space. Still there are many algorithms which uses greedy approaches to build a decision tree.

Hunt’s Algorithm

In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records into successively purer subsets. Let $D_{t}$ be the set of training records that are associated with node $t$ and $y =$ {${y_{1}, y_{2}, y_{3}, …}$} be the class labels. The following is a recursive definition of Hunt’s algorithm.

  1. If all the records are in $D_{t}$ belong to the same class $y_{t}$, then $t$ is a leaf node labeled as $y_{t}$.
  2. If $D_{t}$ contains records that belong to more than one class, an *attribute test condition* is selected to partition the records into smalled subsets. A child node is created for each outcome of the test condition and the records $D_{t}$ are distributed to the children based on the outcome. The algorithm is then recursively applied to each child node.

Hunt’s algorithm works only if every combination of attribute values is present in the training data and each combination has a unique class label. These assumptions are too stringent for use in practical situations.

Design issues of design tree induction

A learning algorithm for inducing decision tree must address the following two issues.

  1. How should the training records be split? Each recursive step of the tree growing process must select an attribute test condition to divide the records into smaller subsets. To implement this step, the algorithm must provide a method for specifying the test condition for different attribute types as well as an objective to measure for evaluating the goodness of each test condition.

  2. How should be splitting procedure stop? A stopping condition is needed to terminate the tree-growing process. A possible strategy is to continue expanding the nodes until either all the records belong to the same class or all the records have identical attribute values. Although both conditions are sufficient to stop any decision tree induction algorithm, other criteria can be imposed to allow the tree-growing procedure to terminate earlier.

Methods for expressing attribute test conditions

  • Binary attributes:- The test condition for a binary attribute generates two potential outcomes.

  • Nominal attributes:- Since a nominal attribute can have many values, its test condition can be expressed in two ways, first a multiway split and secondly only binary splits by considering all $2^k - 1$ ways of creating a binary partition of k attribute values (CART algorithm uses this method).

  • Ordinal attributes:- Oridinal values can also produce binary or multiway splits. Ordinal values can be grouped as long as the grouping does not violate the order property of the attributes value.

  • Continues attributes:- For continues attributes, the test condition can be expressed as a comparison test ($A < v$) or ($A \geq v$) with binary outcomes or range query with outcomes of the form $v_{i} \leq A < v_{i+1}$, for i = 1,…,k. For the binary case, the decision tree algorithm must consider all possible splits position v, and it selects the one that produces the best partition. For multiway split, the algorithm must consider all possible ranges of continues values.

Methods for selecting the best split

There are many measures that can be used to determine the best way to split the records. These measures are defined in the terms in terms of the class distribution of the records before and after splitting.

Let $p(i|t)$ denote the fraction of records belonging to class $i$ at a given node $t$. In a two class-problem, the class ditribution at any node can be written as ($p_{0}, p_{1}$), where $p_{1} = 1 - p_{0}$

The measures developed for selecting the best split are often based on the degree of impurity of the child nodes. The smaller the degree of impurity, the more skewed the class distribution. For example, a node with class ditribution (0, 1) has zero impurity, where as a node with uniform class distribution (0.5, 0.5) has the highest impurity. Examples of impurities measures are:-

  • Entropy:- $$Entropy(t) = - \sum_{i=0}^{c-1} p(i|t) log_{2} p(i|t)$$
  • Gini:- $$Gini(t) = 1 - \sum_{i=0}^{c-1}[p(i|t)]^{2}$$
  • Classification error:- $$Classification\ Error = 1 - \max_{i=0}^{c-1}[p(i|t)]$$

where c is the number of classes and $0log_{2} = 0$ in case of entropy calculation. The following graph depicts the comparison between various impurities measure with various partition sizes.

Some example of calculation of various impurity measures for a binary split are as follows:-

Node N1Count

$$Gini = 1 - (\frac{0}{6})^2 - (\frac{6}{6})^2 = 0$$ $$Entropy = - \frac{0}{6}\log_{2}(\frac{0}{6}) - \frac{6}{6}\log_{2}(\frac{6}{6}) = 0$$ $$Error = 1 - max(\frac{0}{6}, \frac{6}{6}) = 0$$

Node N1Count

$$Gini = 1 - (\frac{1}{6})^2 - (\frac{5}{6})^2 = 0.278$$ $$Entropy = - \frac{1}{6}\log_{2}(\frac{1}{6}) - \frac{5}{6}\log_{2}(\frac{5}{6}) = 0.650$$ $$Error = 1 - max(\frac{1}{6}, \frac{5}{6}) = 0.167$$

Node N1Count

$$Gini = 1 - (\frac{3}{6})^2 - (\frac{3}{6})^2 = 0.5$$ $$Entropy = - \frac{3}{6}\log_{2}(\frac{3}{6}) - \frac{3}{6}\log_{2}(\frac{3}{6}) = 1$$ $$Error = 1 - max(\frac{3}{6}, \frac{3}{6}) = 0.5$$

Now to determine how well a test condition performs, we need to compare the degree of impurity of the parent node (before splitting) with the degree of impurity of the child nodes. The larger their difference, the better the test condition. The gain, $\Delta$, is a criterion that can be used to determine the goodness of the split:

$$\Delta = I(parent) - \sum_{j=1}^{k} \frac{N(v_{j})}{N}I(v_{j})$$

where $I$ is the impurity measure of a given node. N is the total number of records at the parent node, k is the number of attribute values, and $N(v_{j})$ is the number of records associated with the child node, $v_{j}$. Decision tree induction algorithms often choose a test condition that maximizes the gain $\Delta$. Finally, when entropy is used as the impurity measure in the above equation the difference in entropy is known as Information gain ($\Delta_{info}$).

Consider the following examples of splits in various attribute types.

  • Binary Attributes:- Consider the following diagram Suppose there are two ways to split the data into smaller subsets. Before splitting the gini index is 0.5 as both the classes contain equal number of records. If attribute A is chosen for the split, the gini index of node n1 is 0.4898 as $$1 - (\frac{4}{7})^2 - (\frac{3}{7})^2 = 0.4898$$ and gini index for node n2 is 0.480 $$1 - (\frac{2}{5})^2 - (\frac{3}{5})^2 = 0.48$$ now the weighted average of gini index will be $$\frac{7}{12} * 0.4898 + \frac{5}{12} * 0.480 = 0.486$$ similiarly we can calculate gini index for attribute B as 0.375. Since the subsets for attribute B have a smaller gini index and hence it will be preferred over attribute A.
  • Nominal attributes:- As previously discussed nominal attributes can produce either into binary or multiway splits, consider the following example gini index can be calculated of the child nodes in a similiar way as shown previously and we find that multiway split produces lower gini index and hence it will be preferred in the above example.
  • Continues attributes:- Consider the following example using the same technique we can find the approprite split condition here as well which is the split at v = 97 as the gini index is lowest in it.

Gain ratio

Gain ratio is a splitting criterion which is used to determine the goodness or quality of a split. The criterion is defined as $$Gain\ Ratio = \frac{\Delta_{info}}{Split\ Info}$$ here $Split\ Info = - \sum_{i=1}^{k}P(v_{i})log_{2}P(v_{i})$ and $k$ is the total number of splits.

General Algorithm for decision tree induction

A skelton decision tree induction algorithm call TreeGrowth is shown below

TreeGrowth(E, F):
if stopping_condition(E, F):
  leaf = createNode()
  leaf.label = Classify(E)
  return leaf
  root = CreateNode()
  root.test_cond = find_best_split(E, F)
  let V = {v | v is a possible outcome of root.test_cond}.
  for each v in V:
    Ev = {e | root.test_cond(e) = v and e in E}
    child = TreeGrowth(Ev, F)
    add child as descendent of root and label the edge (root --> child) as v.
  end for
end if
return root
  • The createNode() function extends the decision tree by creating a new node. A node in the decision tree has either a test condition, denoted as node.test_cond, or a class label, denoted as node.label.
  • The find_best_split() function determines which attribute should be selected as the test condition for splitting the test condition for splitting the training records by using measures such as gini index, entropy etc.
  • The Classify() function determines the class label to be assigned to a leaf node.
  • The stopping cond() function is used to terminate the tree-growing process by testing whether all the records have either the same class label or the same attribute values.

After building the decision tree, a tree-pruning step can be performed to reduce the size of the decision tree. Decision trees that are too large are susceptible to a phenomenon known as overfitting.