Question
Please provide Rscript from Rstudio. In Homework 4, you will use Machine Learning Models such as LASSO and Regression Tree to predict the number of
Please provide Rscript from Rstudio.
In Homework 4, you will use Machine Learning Models such as LASSO and Regression Tree to predict the number of college applications expected to receive for a large college. You will perform model assessment by comparing model accuracy between LASSO and Regression Tree and select the model that provides the best accuracy on the test data set.
Work cooperatively to the following tasks in your groups, making sure to divide work evenly and document who contributed what.
Tasks
Download Data Set
I have attached the College data set for this assignment. See below for description of College data set available at ISLR Library:
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
A data frame with 777 observations on the following 18 variables.
Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates
P.Undergrad Number of parttime undergraduates
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate
We will predict the number of applications received Apps using all other variables in the College data set and apply LASSO and Tree regression models and compare their performance (test MSE).
Part 1: LASSO (50 points)
Predict the number of applications received Apps using all other variables in the College data set using LASSO model for variable selection:
Split the data set randomly into training and test data set. (5 points)
Fit Lasso model using glmnet() function on the training data set. (5 points)
Perform cross-validation on the training data set to choose the best lambda. (5 points)
Estimate the predicted values using the best lambda obtained in part (c) on the test data (using the predict() function) and compute test MSE. (10 points)
Compare the Lasso predicted test MSE with the null model (lambda=infinity) test MSE and least square regression model (lambda=0) test MSE. (10 points)
Now construct the Lasso model for the entire data set and obtain the Lasso coefficients using the best lambda obtained in part (c) and report the number of non-zero coefficient estimates. (7 points)
Now use the Lasso predictors obtained in part (f) to fit the Linear Regression Model and report the summary of the linear model. (8 points)
Hint: You can refer to the program for "P2_LASSO_HittersData_OVERVIEW" as a guideline for the assignment Part 1.
Note: Your grade will be based on accuracy of the code. You do not have to provide any written explanation of the Part 1 (a) - (g) in the code.
Part 2: Regression Tree (50 points)
Predict the number of applications received Apps using all other variables in the College data set based on a Regression Tree:
Perform the following tasks: Use the training and test data set that you created in Part 1(a).
Fit a Regression Tree to the training data, with Apps as the response and the all other variables as predictors. Use the summary() function to produce summary statistics about the tree. Note how many terminal nodes the tree has. (5 points)
Type in the name of the tree object in order to get a detailed text output. (5 points)
Create a plot of the tree. (Hint: use plot() and text() functions) (5 points)
Now use cross validation function cv.tree() to the training data set to see whether pruning the tree will improve performance (to determine the optimal tree size) (5 points)
Produce a plot with tree size on the x-axis and cross-validated classification error on the y-axis. (Hint: use the plot() function) (5 points)
Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation in parts (d) and (e). If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with eight terminal nodes. (5 points)
Compute the test error rates (test MSE) between the pruned and unpruned trees. (10 points)
Compare the above two test error rates in part (g) (pruned and unpruned trees) with the one obtained using LASSO regression (test MSE) in Part 1(d). (10 points)
Note: Part 2h will require you to provide explanation. Provide your answer in the R Script at the end of the program. All other parts 2(a) - 2(g) do not require any explanation. Your grade will be based on the execution of the code.
Hint: You can refer to the program for P2_RegressionTree_HittersData.R as a guideline for the assignment Part 2.
Deliverables
Submit one R program (one file) containing all parts of the assignment (mark/comment so that each part is separated clearly in the program).
R code should provide comments on sections of the assignment the code is intended for. For example mark the code (using comment symbol #) for which parts 1 and 2 the code is for marking each subparts a - h clearly. Also indicate which team member contributed to which sections of the code.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started