The file P17_26.xlsx contains Gender, Age, Education, and Success (Yes/No) data of 1000 people. The purpose is

Question:

The file P17_26.xlsx contains Gender, Age, Education, and Success (Yes/No) data of 1000 people. The purpose is to see how a classification tree method can use the first three variables to classify Success. You start with 564 Yes values and 436 No values. This is quite diverse (close to 50-50), and as explained in the file, it has a diversity index of 0.9836, the highest being 1. The question you are asked to explore is which splits you should make to reduce this diversity index-that is, to make the subsets purer. Directions are given in the file. (Note that the method suggested is only one variation of splitting and measuring diversity in classification trees. When a Microsoft Data Mining add-in (not discussed here) is used on this data set, it finds an extremely simple rule: Classify as Yes if Education is UG or G, and classify as No if Education is HS. This is slightly different from what your method will find.)