Question
In this homework we will consider the problem of measuring body fat to practice model selection and obtain predictive models. In order to accurately measure
In this homework we will consider the problem of measuring body fat to practice model selection and obtain predictive models. In order to accurately measure body fat typically a measure of body density is used to estimate the proportion of fat in the body using theBrozekorSiriequations. However determining the body density in terms of weight and volume can be difficult.
To avoid this problem, a method that is able to estimate body fat using indirect measures (that are easy to obtain) is desired. To this end, researchers recorded age, weight, height and 10 body circumference measurements for 252 men. For each subject, an accurate estimate of body density was obtained using and underwater submersion method and the body fat was determined using the Brozek and Siri equations. The dataset with all these measurements is available in the file "fat.csv".
1.Read the data set in R and name itfat. Print the first 10 observations using the functionhead(.)(you can use the commandhelp(head)to learn how to modify the number of observations printed in the output). The first two columns correspond to the values for the Brozek and Siri equations obtained using the measurement of density in the third column. All other columns are potential predictors for this problem.
(a)Use the commanddim(data)to learn the dimensions of your data: The first value us the number of rows (number of observations) and the second value is the number of columns (number of variables including responses and predictors).
(b)We will split the data in a "training set" and a "testing set" to further assess the performance of the models we will fit. To this end, we will remove every tenth observations by running the code below
n <- nrow(fat) remove.ind <- seq(10, n, by=10) test <- fat[remove.ind, ] train <- fat[-remove.ind, ]
We will use the training data in problems 2 - 4 and the testing data in question 5.
2. Using the training data fit a model called "full" usingsirias the response variable and all other columns (with the exception ofbrozekanddensity) as predictors. Print the summary output of this model and comment on the results.
3.Install and load the packageolsrrto run stepwise selection procedures.
(a) Run the command below to do forward elimination elimination
# Use help to learn more about the arguments in the function forwardmod <- ols_step_forward_p(full, penter=0.05, progress=FALSE) summary(forwardmod$model)
(b) Run the command below to do backward elimination
# Use help to learn more about the arguments in the function backmod <- ols_step_backward_p(full, prem=0.05, progress=FALSE) summary(backmod$model)
(c) Do the selected variables differ in the backward and forward elimination proce- dures? Typeforwardmodandbackmodto learn the number of steps taken by each approach. Explain the difference in the first step in each approach and how the first variable was entered/removed depending the case.
4.Install and load the packageleapsfor model selection criteria. Run the code below to answer the following questions
p <- ncol(train) - 3 # response, brozek, density not predictors all <- regsubsets(siri ~ . - brozek - density, data=train, nvmax=p)
(a)Compute the AIC values by running the command
calc.AIC <- n * log(summary(all)$rss / n) + 2*(2:(p+1)).
Then, run the commandsummary(all)$which[which.min(calc.AIC),]. The output will list all the variables in the model and will flag asTRUEthose that are included in the model that yield the smallest AIC. Fit a model called "min.AIC" including these variables and print the summary output.
(b)Run the commandsummary(all)$which[which.min(summary(all)$bic),]to determine the variables in the model that yield the smallest BIC. Fit a model called "min.BIC" including these variables, and print the summary output.
(c)Runthecommandsummary(all)$which[which.min(summary(all)$cp),]tode- termine the variables in the model that yield the smallest Cp. Fit a model called "min.cp" including these variables, and print the summary output.
(d)Do these criteria lead to the same models? If not, explain the differences that you observe.
5.Finally, we can compare the performance of these models in terms of prediction by using the testing set. Run the commandtruth <- test$sirito retrieve the observed values for the variableSiriin the testing set.
Then for each one of the 6 models fitted in this homework run the commandpredict(modelname, newdata=test)to obtain the predicted values for the variableSiriin terms of the pre-
dictors values in the testing set. For example, to do prediction based on the model "min.AIC"youwillrunthecommandpred.AIC <- predict(min.AIC, newdata=test).
(a)Obtain the root mean square error (RMSE) to measure the difference betweenthe predicted values and the observed values in the testing set for each one of themodels considered in this exercise. For example, to obtain the RMSE for the model "min.AIC"youcanrunthecommandrmse.full <- sqrt(mean((truth-pred.AIC)^2)).
(b)Based on the RMSE values obtained in part (a), which model(s) do you think do a better job in prediction? Explain.
Dataset provided in the link below:
https://drive.google.com/file/d/10CxO7vJhIEohF6xdV68XELyPWu-FnyBT/view?usp=sharing
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started