Question
First, we split the data set into a training set and a test set by using the following command lines. library(ISLR) data(Credit) set.seed(15) Credit
First, we split the data set into a training set and a test set by using the following command lines.
library(ISLR)
data("Credit")
set.seed(15)
Credit <- Credit[,-1] # remove ID column
train <- sample(nrow(Credit), 300)
Credit.train <- Credit[train, ]
Credit.test <- Credit[-train, ]
(a) Fit a tree to the training data, with Balance as the response and the other variables. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training MSE? How many terminal nodes does the tree have?
(b) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
(c) Create a plot of the tree, and interpret the results.
(d) Predict the response on the test data. What is the test MSE?
(e) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
(f) Produce a plot with tree size on the x-axis and cross-validated error on the y-axis.
(g) Which tree size corresponds to the lowest cross-validated error? 1
(h) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
(i) Compare the training MSEs between the pruned and unpruned trees. Which is higher?
(j) Compare the test MSEs between the pruned and unpruned trees. Which is higher?
(k) Fit a bagging model to the training set with Balance as the response and the other variables. Use 1,000 trees (ntree = 1000). Use the importance() function to determine which variables are most important.
(l) Use the bagging model to predict the response on the test data. Compute the test MSE.
(m) Fit a random forest model to the training set with Balance as the response and the other variables. Use 1,000 trees (ntree = 1000). Use the importance() function to determine which variables are most important.
(n) Use the random forest to predict the response on the test data. Compute the test MSE.
(o) Fit a boosting model to the training set with Balance as the response and the other variables. Use 1,000 trees, and a shrinkage value of 0.01 ( = 0.01). Which predictors appear to be the most important?
(p) Use the boosting model to predict the response on the test data. Compute the test MSE.
(q) Fit a GAM to the training set with Balance as the response and the other variables, and use the GAM to predict the response on the test data. Compute the test MSE.
(r) Compare the test MSEs between the unpruned trees, pruned trees, bagging, random forest, boosting, and GAM. Which performs the best?
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started