Exercise 7.10 In choosing which feature to split on in decision-tree search, an alternative heuristic to the

Question:

Exercise 7.10 In choosing which feature to split on in decision-tree search, an alternative heuristic to the max information split of Section 7.3.1 is to use the Gini index.

The Gini index of a set of examples (with respect to target feature Y) is a measure of the impurity of the examples:

giniY(Examples) = 1 −Σ

Val

|{e ∈ Examples : val

(e, Y) = Val}|

|Examples|

2 where |{e ∈ Examples : val

(e, Y) = Val}| is the number of examples with value Val of feature Y, and |Examples| is the total number of examples. The Gini index is always non-negative and has value zero only if all of the examples have the same value on the feature. The Gini index reaches its maximum value when the examples are evenly distributed among the values of the features.

One heuristic for choosing which property to split on is to choose the split that minimizes the total impurity of the training examples on the target feature, summed over all of the leaves.

(a) Implement a decision-tree search algorithm that uses the Gini index.

(b) Try both the Gini index algorithm and the maximum information split algorithm on some databases and see which results in better performance.

(c) Find an example database where the Gini index finds a different tree than the maximum information gain heuristic. Which heuristic seems to be better for this example? Consider which heuristic seems more sensible for the data at hand.

(d) Try to find an example database where the maximum information split seems more sensible than the Gini index, and try to find another example for which the Gini index seems better. [Hint: Try extreme distributions.]

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question
Question Posted: