Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Please provide Rscript from Rstudio. In Homework 4, you will use Machine Learning Models such as LASSO and Regression Tree to predict the number of

Please provide Rscript from Rstudio.

In Homework 4, you will use Machine Learning Models such as LASSO and Regression Tree to predict the number of college applications expected to receive for a large college. You will perform model assessment by comparing model accuracy between LASSO and Regression Tree and select the model that provides the best accuracy on the test data set.

Work cooperatively to the following tasks in your groups, making sure to divide work evenly and document who contributed what.

Tasks

Download Data Set

I have attached the College data set for this assignment. See below for description of College data set available at ISLR Library:

Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.

A data frame with 777 observations on the following 18 variables.

Private A factor with levels No and Yes indicating private or public university

Apps Number of applications received

Accept Number of applications accepted

Enroll Number of new students enrolled

Top10perc Pct. new students from top 10% of H.S. class

Top25perc Pct. new students from top 25% of H.S. class

F.Undergrad Number of fulltime undergraduates

P.Undergrad Number of parttime undergraduates

Outstate Out-of-state tuition

Room.Board Room and board costs

Books Estimated book costs

Personal Estimated personal spending

PhD Pct. of faculty with Ph.D.s

Terminal Pct. of faculty with terminal degree

S.F.Ratio Student/faculty ratio

perc.alumni Pct. alumni who donate

Expend Instructional expenditure per student

Grad.Rate Graduation rate

We will predict the number of applications received Apps using all other variables in the College data set and apply LASSO and Tree regression models and compare their performance (test MSE).

Part 1: LASSO (50 points)

Predict the number of applications received Apps using all other variables in the College data set using LASSO model for variable selection:

Split the data set randomly into training and test data set. (5 points)

Fit Lasso model using glmnet() function on the training data set. (5 points)

Perform cross-validation on the training data set to choose the best lambda. (5 points)

Estimate the predicted values using the best lambda obtained in part (c) on the test data (using the predict() function) and compute test MSE. (10 points)

Compare the Lasso predicted test MSE with the null model (lambda=infinity) test MSE and least square regression model (lambda=0) test MSE. (10 points)

Now construct the Lasso model for the entire data set and obtain the Lasso coefficients using the best lambda obtained in part (c) and report the number of non-zero coefficient estimates. (7 points)

Now use the Lasso predictors obtained in part (f) to fit the Linear Regression Model and report the summary of the linear model. (8 points)

Hint: You can refer to the program for "P2_LASSO_HittersData_OVERVIEW" as a guideline for the assignment Part 1.

Note: Your grade will be based on accuracy of the code. You do not have to provide any written explanation of the Part 1 (a) - (g) in the code.

Part 2: Regression Tree (50 points)

Predict the number of applications received Apps using all other variables in the College data set based on a Regression Tree:

Perform the following tasks: Use the training and test data set that you created in Part 1(a).

Fit a Regression Tree to the training data, with Apps as the response and the all other variables as predictors. Use the summary() function to produce summary statistics about the tree. Note how many terminal nodes the tree has. (5 points)

Type in the name of the tree object in order to get a detailed text output. (5 points)

Create a plot of the tree. (Hint: use plot() and text() functions) (5 points)

Now use cross validation function cv.tree() to the training data set to see whether pruning the tree will improve performance (to determine the optimal tree size) (5 points)

Produce a plot with tree size on the x-axis and cross-validated classification error on the y-axis. (Hint: use the plot() function) (5 points)

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation in parts (d) and (e). If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with eight terminal nodes. (5 points)

Compute the test error rates (test MSE) between the pruned and unpruned trees. (10 points)

Compare the above two test error rates in part (g) (pruned and unpruned trees) with the one obtained using LASSO regression (test MSE) in Part 1(d). (10 points)

Note: Part 2h will require you to provide explanation. Provide your answer in the R Script at the end of the program. All other parts 2(a) - 2(g) do not require any explanation. Your grade will be based on the execution of the code.

Hint: You can refer to the program for P2_RegressionTree_HittersData.R as a guideline for the assignment Part 2.

Deliverables

Submit one R program (one file) containing all parts of the assignment (mark/comment so that each part is separated clearly in the program).

R code should provide comments on sections of the assignment the code is intended for. For example mark the code (using comment symbol #) for which parts 1 and 2 the code is for marking each subparts a - h clearly. Also indicate which team member contributed to which sections of the code.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Finance questions