Question

1 Approved Answer

Posted on May 19, 2024

--title: STAT270 - Project 3 - Modeling with Regression author: Antoin richmond date: April 4, 2020 output: html_document: default --```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

--title: "STAT270 - Project 3 - Modeling with Regression" author: Antoin richmond date: "April 4, 2020" output: html_document: default --```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r packages, warning = FALSE, message = FALSE} # load the packages for graphing and data wrangling library(ggplot2) library(PASWR2) library(car) # install the car package first, if not already installed library(dplyr) library(lattice) library(boot) library(MASS) ``` Note: If you `Rmd` file submission knits you will receive total of **(5 points)** ## Case Study: Real Estate Data and ideas for this case study come from (*Militino et al., 2004*). **Problem 16, page 899** (**not entire problem**, only specified parts below) The goal of this case study is to walk the user through the creation of a parsimonious multiple linear regression model that can be used to predict the total price (totalprice) of apartments by their hedonic (structural) characteristics. The data frame `VIT2005` contains several variables, and further description of the data can be found in the help file (listed below). A data frame with 218 observations on the following 5 variables: * `totalprice` (the market total price (in Euros) of the apartment including garage(s) and storage room(s)) * `area` (the total living area of the apartment in square meters) * `zone` (a factor indicating the neighborhood where the apartment is located with levels `Z11, Z21, Z31, Z32, Z34, Z35, Z36, Z37, Z38, Z41, Z42, Z43, Z44, Z45, Z46, Z47, Z48, Z49, Z52, Z53, Z56, Z61, and Z62`) * `category` (a factor indicating the condition of the apartment with levels `2A, 2B, 3A, 3B, 4A, 4B`, and `5A` ordered so that `2A` is the best and `5A` is the worst) * `age` (age of the apartment in years) * `floor` (floor on which the apartment is located) * `rooms` (total number of rooms including bedrooms, dining room, and kitchen) * `out` (a factor indicating the percent of the apartment exposed to the elements: The levels `E100, E75, E50, and E25`, correspond to complete exposure, `75`% exposure, `50`% exposure, and `25`% exposure, respectively.) * `conservation` (is an ordered factor indicating the state of conservation of the apartment. The levels `1A, 2A, 2B, and 3A` are ordered from best to worst conservation.) * `toilets` (the number of bathrooms) * `garage` (the number of garages) * `elevator` (indicates the absence (0) or presence (1) of elevators.) * `streetcategory` (an ordered factor from best to worst indicating the category of the street with levels `S2, S3, S4, and S5`) * `heating` (a factor indicating the type of heating with levels `1A, 3A, 3B, and 4A` which correspond to: no heating, low- standard private heating, high-standard private heating, and central heating, respectively.) * `storage` (the number of storage rooms outside of the apartment) ### Complete the parts below **(10 pts)** **Quiz-Project 3 Pr.1** (a) Characterize the shape, center, and spread of the variable `totalprice`. Hint: Use `ggplot` function from `ggplot2` package to graph the `totalprice` density function. Use `median` and `IQR` to find the median and IQR for the `totalprice`. **Solution:** Hint: Use the template for the graph: `ggplot(data = YOUR DATA, aes(x = variable to plot)) + geom_density(fill = "your favorite color") + theme_bw()` YOUR CODE HERE: ```{r} # MD <- median(VIT2005$totalprice) # # iqr <- IQR(VIT2005$totalprice) # # c(MD, iqr) ``` **Observation:** The distribution of `totalprice` is ... with a median of `...` and an `IQR` of `...` ################################################################# ######################################## **(10 pts)** **Quiz-Project 3 Pr.2** (b) Use `scatterplotMatrix()` from `car` package or `pairs()` to explore the relationships between totalprice and the numerical explanatory variables `area, age, floor, rooms, toilets, garage, elevator`, and `storage`. Hint: To use `scatterplotMatrix` type `scatterplotMatrix( ~ totalprice + var1 + var2 + ... + var n, data = VIT2005)`, do not use more than 5 variables to produce input that fits the screen and can be reviewed. Use the command as many times as you need to review how `totalprice ` correlates with other variables in the data. **Solution:** YOUR CODE HERE: ```{r} # e.g. matrix can be produced with scatterplotMatrix( ~ totalprice + area + age + floor + rooms, data = VIT2005) # or use # pairs(~ totalprice + area + age + floor + rooms, data = VIT2005) # # pairs(~totalprice + toilets + garage + elevator + storage, data = VIT2005) ``` Observation: The variable `totalprice` appears to have a moderate linear relationship with `area`. (c) **Total of (55 pts)** **Quiz-Project 3** Pr.3 to 8, 10 pts for each correctly removed variable, 5 pts to find the correlation. Compute the correlation between `totalprice` and all of the other numerical variables. List the **three** variables in order along with their correlation coefficients that have the highest correlation with totalprice. #### **Model (A)**: Use backward elimination to develop a model that predicts totalprice using the data frame `VIT2005`. Use a of `5`%. Store the final model in the object `modelA`. **(5 pts)** **Quiz-Project 3 Pr.3** The correlation coefficients are: ```{r} NUM <- c("area", "age", "floor", "rooms", "toilets", "garage","elevator", "storage") COR <- cor(VIT2005[, "totalprice"], VIT2005[, NUM]) COR ``` **Observation:** The highest three correlations with `totalprice` occur with `area` (0.8092), `toilets` (0.6876), and `rooms`(0.5256). **Model (A)** The functions `drop1()` and `update()` are used to create a model using backward elimination. **Hint:** `drop1(model_.be_name, test = "F")` test for significance of all individual predictors given all others are already in the model. ```{r} # use a model where totalprice is regressed on all other variables model.be <- lm(totalprice ~ ., data = VIT2005) # drop1(model.be, test = "F") ``` **(10 pts)** (part c.1) **Quiz-Project 3 Pr.4** Which one appears most **insignificant** (biggest `P-value`)? Drop it from the model. E.g. If `age` is most insignificant, use the `update` function to update the model with the following code: `model.be <- update(model.be, .~. - age)` - the first dot means we use the same response variable, the second all previously included predictors minus `age`. The again use `drop1` and so on, until all remaining variables are of significance `0.05` as specified in the directions of the problem YOUR CODE HERE: ```{r} model.be <- update(model.be, .~. - age) drop1(model.be, test = "F") ``` **(10 pts)** (part c.2) **Quiz-Project 3 Pr.5** Which one is to be dropped next? Answer: ```{r} ``` **(10 pts)** (part c.3) **Quiz-Project 3 Pr.6** Which one is to be dropped next? Answer: ```{r} ``` **(10 pts)** (part c.4) **Quiz-Project 3 Pr.7** Which one is to be dropped next? Answer: ```{r} ``` **(10 pts)** (part c.5) **Quiz-Project 3 Pr.8** Which one is to be dropped next? Answer: ```{r} ``` If all variable are significant create the object holding the optimal model as per the description of the problem using the code: ```{r} formula(model.be) modelA <- lm(formula(model.be), data = VIT2005) ``` **Question 1: (5 pts)** **Quiz-Project 3 Pr.9** Which variable are left in the model, list them? Observation: Backward elimination suggests using the variables `area`, `zone`, `category`, `out`, `toilets`, `garage`, `elevator`, `streetcorner`, and `heating` to best predict `totalprice`. (i) **(5 pts)** **Quiz-Project 3 Pr.10** Compute $CV_n$, the leave-one-out cross validation error, for `modelA`. Set the seed to `5` and compute $CV_5$, the **five-fold** cross validation error, for `modelA`. The cross validation error for a generalized linear model can be computed using the `cv.glm()` from the `boot` package. Using the function `glm()` without passing a family argument is the same as using the function `lm()`. `R Code 12.3` provides a template of how to use the `cv.glm()` function. Note that $CV_n$ is returned with `cv.error$delta[1]`. To compute $CV_5$, pass the value `5` to the argument `K` inside the `cv.glm()` function. ```{r R Code 12.3} # # the code will not work, it is only template # mod.glm <- glm(y ~ x1 + x2, data = DF) # library(boot) # # cv.error <- cv.glm(data = DF, glmfit = mod.glm) # cv.error$delta[1] ``` (ii) **(5 pts)** **Quiz-Project 3 Pr.11** Compute $R^2$, $R^2_a$, the `AIC`, and the `BIC` for **Model (A)**. What is the proportion of total variability explained by **Model (A)**? **Your Solution:** (i) ```{r} modelAg <- glm(formula(model.be), data = VIT2005) # # Fro cv.glm - the default is to set K equal to the number of observations in data which gives the usual leave-one-out crossvalidation. # # cv.errorN <- cv.glm(VIT2005, modelAg) # # CVNa <- cv.errorN$delta[1] # CVNa ``` ```{r} # set.seed(5) # use for replication purposes # cv.error5 <- cv.glm(VIT2005, modelAg, K = 5) # CV5a <- cv.error5$delta[1] # CV5a ``` Observation: The $CV_n = ...$ for Model (A), and $CV_5 =...$ for Model (A). (ii) Since this problem and a few more will request many goodness of fit statistics, a function called `mgof()` is written to compute the requested values. Use it as shown below. ```{r} mgof <- function(model = model, data = DF, ...){ R2a <- summary(model)$adj.r.squared R2 <- summary(model)$r.squared aic <- AIC(model) bic <- AIC(model, k = log(nrow(data))) se <- summary(model)$sigma form <- formula(model) ANS <- c(R2 = R2, R2.adj = R2a, AIC = aic, BIC = bic, SE = se) ANS } MGOF <- mgof(model = modelA, data = VIT2005) MGOF ``` Observation: The total proportion of variability explained by modelA is `0.9138`. (d) **(10 pts)** **Quiz-Project 3 Pr.12** Explore the residuals of the Models (A) using the function `residualPlot()` or `residualPlots()` from the package `car`. Comment on the results. (**Diagnostics: Checking the model assumptions**) ```{r} # # uncomment to run the residual plot for the modelA # residualPlot(modelA, main = "Model A") ``` **(d) Question 2:** Is there curvature of the residuals on the above plot? Observation: The residuals versus the fitted values for Model (A) have a definite curvature indicating the model is not quite adequate. **Extra Credit** (10 pts) **Quiz-Project 3 Pr.13** **(e)** Use the function `boxCox()` from car to find a suitable transformation for totalprice. Build Model (E) Use backward elimination to develop a model that predicts `log(totalprice)` using the data frame `VIT2005`. Use a of 5%. Store the final model in the object `modelE`. For details on `boxCox()` see page 856 in text ```{r} boxCox(modelA, lambda = seq(-0.5, 0.5, length = 200)) ``` #### Observation: A `log` transformation is suggested for the response `totalprice` in **Model (A)**. **Extra Credit** (20 pts) **Quiz-Project 3 Pr.14** - 4 pts for excluding the correct variable at each step. #### **Model (E)**: Use backward elimination to develop a model that predicts `log(totalprice)` using the data frame `VIT2005`. Use a of `5`%. Store the final model in the object `modelE`. **Model (E)** The functions `drop1()` and `update()` are used to create a model using backward elimination. (as shown at the beginning) ```{r} # transform the response variable VIT2005$logtotalprice <- log(VIT2005$totalprice) # build the fitted full model model.be <- lm(logtotalprice ~ ., data = VIT2005[ ,-1]) # exclude the totalprice variable since it is no longer response variable drop1(model.be, test = "F") ``` and so on. Show less