Answered step by step
Verified Expert Solution
Question
1 Approved Answer
So I currently have this problem that I need to solve, but I am very stuck on the actual solution. It revolves around the algorithm
So I currently have this problem that I need to solve, but I am very stuck on the actual solution. It revolves around the algorithm for building a CART model in case of regression.
I've added the images for the problem here in the question:
Problem 1: (30 points) Consider the algorithm for building a CART model in the case of regression. Following and ex- panding on the notation from class, suppose that our current tree, denoted by Told, has |Told| = M terminal nodes/buckets. For each bucket m = 1, ..., M, let: 1. Am denote the number of observations in bucket m , 2. Qm(Told) denote the value of the impurity function at bucket m , and 3. Am denote the region in the feature space corresponding to bucket m . Also let N be the overall total number of observations. Recall that, in the case of regression we have that: Qm(Told) = No E ( yi - ym)? i:rie Rm where ym = N Zizi ER, yi is the mean response in bucket m.Then the total impurity cost of the tree Told is defined as: M Cimp (Told) = _ Nmom(Told) . m=1 Consider a potential split at the final bucket M (we're using M just for ease of notation), which results in a new tree Tnew. This new tree has |new| = M + 1 terminal nodes/buckets, and for this new tree we let 1. Am denote the number of observations in bucket m , 2. Qm(Tnew) denote the value of the impurity function at bucket m , and 3. Am denote the region in the feature space corresponding to bucket m . The total impurity cost of the tree Tnew is defined analogously as: M+1 Cimp(Tnew) = > NmQm(Tnew) . m=1 Please answer the following: a) (10 points) Let A = Gimp(Told) - Cimp(Tnew) be the absolute decrease in total impurity resulting from the split. Derive a formula for A that can be computed locally at the bucket M, in other words it should only depend on the data points that fall in region Ry in the original tree Told. (Hint: we've discussed this concept in class, this question is asking for a more formal argument. You may assume that the two new buckets in Tnew resulting from the split are labeled as buckets M and M + 1 in Thew.) b) (10 points) Show that A 2 0, hence splitting always reduces the total impurity cost. (Hint: you can use the fact that, given a sequence of real numbers 21, 22, ..., Zn, the mean 2 = n Lin z is the minimizer of the function RSS(z) = >_1(z - z)") c) (10 points) Let Rog be the training set R" value for the model defined by Told, and likewise let Rnew be the training set R' value for the model defined by Thew. Let SST = > N (yi - 9)2 be the total sum of squared errors, where y = , _"_, y; is the overall mean. For a given value of the complexity parameter (cp) o 2 0, recall the modified cost function that is relevant in the pruning step: Ca(T) = Cimp (T) + Q . SST . IT| Show that Ca(Tnew) S Ca(Told) if and only if R2- Rnew - Rold 2 0. (Hence the choice of retaining a split if the increase in R is at least o is equivalent to retaining a split if the modified cost function is smaller after the split.)Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started