Question

1 Approved Answer

Posted on Jul 04, 2024

For this question, we are going to create data, and then estimate models on this simulated data. This allows us to effectively know the population

For this question, we are going to create data, and then estimate models on this simulated data. This allows us to effectively know the population parameters that we are trying to estimate. Consequently, we can reason about how well our models are doing.

create_homoskedastic_data <- function(n = 100) { d <- data.frame(id = 1:n) %>% mutate( x1 = runif(n=n, min=0, max=10), x2 = rnorm(n=n, mean=10, sd=2), x3 = rnorm(n=n, mean=0, sd=2), y = .5 + 1*x1 + 0*x2 + .25*x32 + rnorm(n=n, mean=0, sd=1) ) return(d) }

d <- create_homoskedastic_data(n=100)

Produce a plot of the distribution of the outcome data. This could be a histogram, a boxplot, a density plot, or whatever you think best communicates the distribution of the data. What do you note about this distribution?

outcome_histogram <- d %>% ggplot() # fill in the rest of this chunk to plot # you will need aes layers (to map data into the plot) # and geom_* layers to draw the plot. You can delete these # comments if you like.

"Fill in here: What do you notice about this distribution?"

Are the assumptions of the large-sample model met so that you can use an OLS regression to produce consistent estimates? "Fill in here: Are the large-sample assumptions satisfied?"

Estimate four models, called model_1, model_2, model_3 and model_4 that have the following form:

Y = 0 + 1x1 + 0x2 + 3x3 + (1) Y = 0 + 1x1 + 2x2 + 3x3 + (2) Y = 0 + 1x1 + 2x2 + 3x23 + (3) Y = 0 + 1x1 + 2x2 + 3x3 + 4x23 + (4)

# If you want to read about specifying statistical models, you can read # here: https://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models' # note, using the I() function is preferred over using poly() model_1 <- 'fill this in' model_2 <- 'fill this in' model_3 <- 'fill this in' model_4 <- 'fill this in'

calculate_msr <- function(model) { # This function takes a model, and uses the `resid` function # together with the definition of the msr to produce # the MEAN of the squared residuals msr <- mean(resid(model)2) return(msr) } model_1_msr <- 'fill this in' model_2_msr <- 'fill this in' model_3_msr <- 'fill this in' model_4_msr <- 'fill this in'

Consider, for a moment, only the first model. Is it possible to select coefficients in this model that would produce a lower mean squared residual? Why or why not?

Which of these models does the best job, in terms of mean squared residuals, at estimating the population coefficients?

Is there any evidence that the additional parameter that you have estimated in model_2 makes make this second model more fully represent the true population? Conduct an F-test with the null hypothesis that model_1 is the correct population model, and evaluate whether you should reject the null to instead conclude that model_2 is more appropriate.

## anova(model_2, model_1, test = 'F')

Explain why the p-values for the tests that you have conducted in parts (a) and (b) are the same. Are these tests merely different ways of asking the same question of a model?