Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 23, 2024

1. Autocorrelation In this problem, you will simulate two error distributions with the same mean and sd. One will be pure white noise (a normal

1. Autocorrelation

In this problem, you will simulate two error distributions with the same mean and sd. One will be pure white noise (a normal distribution) and the other will intentionally have a lot of autocorrelation. The R code to generate these variables is:

set.seed(2) x1 <- rnorm(101) # default mean=0, sd=1 x2 <- x1 for(i in 2:101){x2[i]=x2[i-1]+x2[i]} # what does this do? x2 <- scale(x2) # makes x2 a z score xx <- seq(0,10,.1) # generates the sequence 0,.1,.2, etc. y1 <- 5-.5*xx + x1 # regression line with errors = x1 y2 <- 5-.5*xx + x2 # regression line with errors = x2

Use lm() to fit regression lines to models y1 ~ xx and y2 ~ xx

fit1 <- lm(y1 ~ xx) fit2 <- lm(y2 ~ xx)

Compare fits using summary(). Which one is better? Compare R^2 and standard error of regression.

Use dwtest(fit) (library(car)) to check for autocorrelation. Recall that the null hypothesis is d=2 (no autocorrelation). What did you find?

#install.library("lmtest") library(lmtest)

dwtest(fit1)

Look at the residual plots for both models [for example: plot(xx,fit1$residuals)]. Is something wrong?

A solution to this problem available in Stata is to use "robust residuals." Use Google to find out what this means and a function that does it. Apply this method to y2 ~ xx. Does this change your answer to part (b)?

2. Control variables

We looked at the Fair Plan data from Chicago briefly in class. The code below will load the file which contains some relevant data summarized by zip code. This data was used in litigation to evaluate a claim that insurance companies were discriminating based on the racial composition of neighborhoods. You will now evaluate the argument.

#Install.packages("faraway") ## if needed library(faraway) data("chicago") str(chicago)

## 'data.frame': 47 obs. of 7 variables: ## $ race : num 10 22.2 19.6 17.3 24.5 54 4.9 7.1 5.3 21.5 ... ## $ fire : num 6.2 9.5 10.5 7.7 8.6 34.1 11 6.9 7.3 15.1 ... ## $ theft : num 29 44 36 37 53 68 75 18 31 25 ... ## $ age : num 60.4 76.5 73.5 66.9 81.4 52.6 42.6 78.5 90.1 89.8 ... ## $ volact : num 5.3 3.1 4.8 5.7 5.9 4 7.9 6.9 7.6 3.1 ... ## $ involact: num 0 0.1 1.2 0.5 0.7 0.3 0 0 0.4 1.1 ... ## $ income : num 11744 9323 9948 10656 9730 ...

attach(chicago)

A Fair Plan is an assigned risk program for property insurance that is created by state laws called "Fair Access to Insurance." This is intended to allow property owners to buy coverage required by mortgage lenders on buildings (including homes) where there is not a competitive insurance market. These may be older buildings or buildings in places that are not served by nearby fire departments, for example.

In the chicago data, the variable involact measures the number of applications (per 100) that are placed in the Fair plan. The variable race measures the percentage (x 100) of the population in the zip code that is non-white. The argument is that race is correlated with involact. Use a simple regression model to evaluate this claim.

Before presenting this result in court, you will want to check the model assumptions. Are the residuals normally distributed? Are the variances homogeneous? What about autocorrelation? Report on your findings. You may find the function plot(fit1) [fit1 <- lm(invol ~ race)] helpful as a starting point.

The insurance industry response was that the real explanation for Fair plan placements was underwriting standards based on fire and theft claims and the age of buildings (older buildings are more expensive to repair and more difficult to appraise in advance). These variables are measured by fire, theft and age. Evaluate the insurers claim with a multiple regression model that includes these variables and race as predictors of involact. Is race still significant in this model?

The variables fire, theft and age can be interpreted as statistical controls. We are still primarily interested in the relationship between race and involact (Fair plan placements), but now we have analyzed that relationship after including the effects of the control variables.

Look at the distribution of residuals to see if the required assumptions are met for interpreting these results.

The plots label outliers by case number (row index in the data frame). We have seen that least squares estimates can be very sensitive to outliers. The following code produces a graph of the Cook distances of the residuals, a standardized measure of the absolute value of the distance of individual residuals from the regression line.

halfnorm(cooks.distance(fit2),main="Cook Distances")

Note: the case numbers in the actual data may be different than this.

Two points stand out on this curve. We know that least squares estimates can be affected by outliers, so let's see if this changes anything here. You can accomplish this by adding the parameter subset = 1:47[-c(a,b)] to lm, where a and b are the case numbers you want to remove. For the example shown here, you would use subset = 1:47[c(1,2)]. Make the adjustment and comment on what you observe.

Does this evidence demonstrate that the insurance industry was discriminating based on race in Chicago? This behavior is called "red lining" (visualize a red line drawn around certain areas on a map where no risks are underwritten). The insurance laws in most states were changed to specifically prohibit this based on race. It is permitted to underwrite property exposures based on risk factors, however.

3. Selecting variables

In this problem, you will use regsubsets to select the best model for predicting the crime rate in Boston. The data is in the MASS package. There are 13 variables to choose from. The following code will load the required libraries and data. After loading, you can use ?Boston to learn more about the data and ?regsubsets to learn more about the function that picks the best subset.

# install.packages("leaps","MASS") ## if needed library(leaps) library(MASS) data("Boston")

Begin by examining the data. You may use summary(Boston) to get information about the data and the code below to generate histograms for each of the variables:

title = names(Boston) for(i in 1:13) {hist(Boston[,i],main=title[i])}

Use the arrows in the R Studio viewer window to scroll through the graphs. What do you observe?

The syntax of regsubsets is similar to lm(). The default method is "best subsets." You will want to set the nvmax parameter to 12 so that it will consider all possible models, including the one with all 12 predictors. How many models are possible?

Run regsubsets, using crim ~. as the model and save the summary method as the variable best. This can be done using the code below, for example.

best <- summary(regsubsets(crim ~.,data=Boston,nvmax=12))

Evaluate what you have accomplished by plotting the adjusted R-squared and the cross validated prediction error (cp) using plot(bestcp). What is the best number of variables to use?

You can extract the variables selected by using best$which[nv,], where nv is the number of variables. This is a matrix of logical values, with TRUE in the position for variables used in each model. Note that the intercept is included as a variable. Which variables are used in the model that optimizes the cross validated prediction error (cp)?

Now run the best model in lm(). You can use best

which[nv,]]).

Examine the summary results and check the distribution of the residuals. Comment on what you found.

The issues are largely caused by the skewness of the crim variable. This can be substantially improved by using a log transformation. Rerun regsubsets to obtain the best set of predictors for log(crim). The code below will get you started:

bestlog <- summary(regsubsets(log(crim)~.,data=Boston, nvmax=12))

Does this pick the same predictors for the optimum model using cp? (Hint: compare bestlog to best)

It would be tempting to compare the transformed model to the untransformed model. Discuss the merits (pro and con) for doing this. What units are the residuals measured in for each model. Is there a way to compare them?

Cross validation to estimate prediction accuracy

We will use a simple cross validation method to compare the two models in the previous question. (a) We begin by splitting the data into a training set and a validation set. There are 506 observations in the data frame, so we can split the data in half for this purpose. The function sample() accomplishes this using the following code:

train = sample(506, 253)

Then the two models are refit on the training data.

fitA <- lm(crim ~.,data=Boston[,best$which[8,]],subset=train) fitB <- lm(log(crim) ~.,data=Boston[,bestlog$which[9,]],subset=train)

Now you can use predict() to compute predictions for each model on the entire data frame. Call these predictions predA and predB. Then calculate the vectors of prediction errors, errA and errB. Since predB predicts log(crim), you will need to convert it back to the original units before computing the error. (Hint: exp() is the inverse of log()). Finally, compute the mean squared errors (mean(err^2)) and compare the results. What did you find out?

Try repeating the entire process in part (a) 10 times. Is one model always better than the other?

We can be more systematic about this by using the cross validation function in the boot package. See the Lecture 13 example script for more details. The following code will get you started. The default method uses Leave One Out Cross Validation (LOOCV). It might be slow since it is running 506 regressions.

#install.packages("boot") library(boot)

glm.fitA <- glm(crim~.,data=Boston[best$which[8,-1]]) cv.errA <- cv.glm(Boston,glm.fitA) cv.errA$delta[1]

## [1] 42.96874

glm.fitB <- glm(log(crim)~.,data=Boston[bestlog$which[9,-1]]) cv.errB <- cv.glm(Boston,glm.fitB) cv.errB$delta[1]

## [1] 0.616186

Something is obviously wrong here. The "B" model is reporting MSE in the log units. We can fix this by adding a cost function to compute the required statistic to cross validate. The revised code for model B is:

cost <- function(x,pred){mean((exp(x)-exp(pred))^2)} cv.errB <- cv.glm(Boston,glm.fitB,cost) cv.errB$delta[1]

Now that we are comparing the same units, is this result different than what you got with the simple cross validation?

You can speed this up by using K-fold cross validation. Run the same code with the parameter K=10 added to cv.glm.

cv.errA <- cv.glm(Boston,glm.fitA,K=10) cv.errA$delta[1]

cost <- function(x,pred){mean((exp(x)-exp(pred))^2)} cv.errB <- cv.glm(Boston,glm.fitB,cost,K=10) cv.errB$delta[1]

Try some other values of K. Did the log transformation improve the performance of the model?

**********************************************

You may submit your results in a Word document created using R markdown as described below. That's how this document was created. This is a feature of R that is very useful for creating reports. If you do this, make sure you install the "rmarkdown" package first. Then you can create a new R Markdown file that will "knit" a Word document. The "`" symbol needed to create code chunks is to the left of "1" on the keyboard. It will not work if you use the apostrophe "'" next to the enter key. (It took me some time to figure this out )

There are some strange things that happen during conversion to Word, so you will want to look at your output and fix things that dont look right. Since its in Word, you can use the usually editing features.

Try it. If you dont like it (or cant get it to work), you may submit your answers in a Word file and R code in a script file.