Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

In this problem, we consider splitting when building a regression tree in the CART algorithm. For simplicity, we assume that there is a single feature

image text in transcribed
In this problem, we consider splitting when building a regression tree in the CART algorithm. For simplicity, we assume that there is a single feature X, dependent variable Y, and we have collected a training dataset (x1,y1),,(xn,yn). We also assume, again for simplicity, that we are considering the initial split at the top (root node) of the tree. An arbitrary split simply divides the training dataset into a partition of size two. By appropriately reshuffling the data, we can represent this partition (again for simplicity) via two sub-datasets (x1,y1),,(xN,yN) and (xN+1,yN+1),,(xn,yn) where N is the index of the last observation included in the first set. Assume throughout that our impurity function is the RSS error - the standard choice for a regression tree. Please answer the following: a) What is the total impurity value before the split? (This is the total impurity of the "null tree" or the "baseline model".) b) What is the total impurity value after the split? (This is the total impurity of the tree with the split as defined above.) c) Show that the total impurity value after the split is always less than or equal to the total impurity value before the split, i.e, splitting never increases the total impurity cost function. (Hint: you can use the fact that, given a sequence of real numbers z1,z2,,zn, the mean z=n1i=1nzi is the minimizer of the function RSS(z)=i=1n(ziz)2.) In this problem, we consider splitting when building a regression tree in the CART algorithm. For simplicity, we assume that there is a single feature X, dependent variable Y, and we have collected a training dataset (x1,y1),,(xn,yn). We also assume, again for simplicity, that we are considering the initial split at the top (root node) of the tree. An arbitrary split simply divides the training dataset into a partition of size two. By appropriately reshuffling the data, we can represent this partition (again for simplicity) via two sub-datasets (x1,y1),,(xN,yN) and (xN+1,yN+1),,(xn,yn) where N is the index of the last observation included in the first set. Assume throughout that our impurity function is the RSS error - the standard choice for a regression tree. Please answer the following: a) What is the total impurity value before the split? (This is the total impurity of the "null tree" or the "baseline model".) b) What is the total impurity value after the split? (This is the total impurity of the tree with the split as defined above.) c) Show that the total impurity value after the split is always less than or equal to the total impurity value before the split, i.e, splitting never increases the total impurity cost function. (Hint: you can use the fact that, given a sequence of real numbers z1,z2,,zn, the mean z=n1i=1nzi is the minimizer of the function RSS(z)=i=1n(ziz)2.)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

The Audit Process Principles Practice And Cases

Authors: Iain Gray, Louise Crawford, Stuart Manson

7th Edition

1473760186, 9781473760189

More Books

Students also viewed these Accounting questions