Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1. Decision trees As part of this question you will implement and compare the Information Gain, Gini Index and CART evaluation measures for splits


   

1. Decision trees As part of this question you will implement and compare the Information Gain, Gini Index and CART evaluation measures for splits in decision tree construction.Let D = (X, y), |D|= n be a dataset with n samples. The entropy of the dataset is defined as 2 H(D)=P(c|D)log2 P(ci|D), i=1 where P(c|D) is the fraction of samples in class i. A split on an attribute of the form X, c partitions the dataset into two subsets Dy and DN based on whether samples satisfy the split predicate or not respectively. The split Entropy is the weighted average Entropy of the resulting datasets Dy and Dx: ny H(Dy, DN) = -H(Dy)+ H(DN), n nN n where ny are the number of samples in Dy and ny are the number of samples in DN. The Information Gain (IG) of a split is defined as the the difference of the Entropy and the split entropy: IG(D, Dy, DN) = H(D) - H(Dy, DN). The higher the information gain the better. The Gini index of a data set is defined as G(D) = 1-- P(c;D) and the Gini index of a split is defined as the weighted average of the Gini indices of the resulting partitions: nN G(DY, DN) = G(DY) + TYG(DY) + N G(DN). n The lower the Gini index the better. Finally, the CART measure of a split is defined as: (1) 2 CART (DY, DN) = 2Y NP(ci Dy) - P(c|DN)|. n n i=1 (2) (3) The higher the CART the better. You will need to fill in the implementation of the three measures in the provided Python code as part of the homework. Note: You are not allowed to use existing implementations of the measures. The homework includes two data files, train.txt and test.txt. The first consists of 100 observations to use to train your classifiers; the second has 10 to test. Each file is comma-separated, and each row contains 11 values - the first 10 are attributes (a mix of numeric and categorical translated to numeric, e.g. {T,F} = {0,1}), and the final being the true class of that observation. You will need to separate attributes and class in your load(filename) function. (a) [10 pts.] Implement the IG(D, inder, value) function according to equation 1, where D is a dataset, index is the index of an attribute and value is the split value such that the split is of the form X, value. The function should return the value of the information gain. (b) [10 pts.] Implement the G(D, index, value) function according to equation 2, where D is a dataset, inder is the index of an attribute and value is the split value such that the split is of the form X < value. The function should return the value of the gini index value.

Step by Step Solution

3.49 Rating (152 Votes )

There are 3 Steps involved in it

Step: 1

importing libraries import numpy as nm import matplotlibpyplot as mtp import pandas as pd importing ... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Advertising & IMC Principles & Practice

Authors: Sandra Moriarty, Nancy Mitchell, William Wells

9th Edition

9780132998208, 0132163640, 132998203, 978-0132163644

More Books

Students also viewed these Mathematics questions

Question

your ultimate goal upon graduation (i.e., career goals).

Answered: 1 week ago

Question

Why is it important that advertising be regulated as a business?

Answered: 1 week ago

Question

Describe the various copy elements of a print ad.

Answered: 1 week ago

Question

Calculate the missing values for the promissory notes described

Answered: 1 week ago