Question

1 Approved Answer

Posted on Sep 22, 2024

ONLY QUESTION 1A Task 0.A (2 points) In a code chunk, load the wooldridge, lmtest, sandwich, and AER packages. If you have not yet installed

ONLY QUESTION 1A

Task 0.A (2 points) In a code chunk, load the wooldridge, lmtest, sandwich, and AER packages. If you have not yet installed all of them, then do so. Remember, you never ever use install.pacakges inside a code chunk. You install only once directly in the console, then use library to load in your chunk. Make sure you have put your name in the header of your work, and check now to make sure you can knit to pdf without any problems.

Part 1 The goal of this problem is to get used to interpreting interaction terms, which we introduced in the second half of Multivariate Regression. You can start this problem set, and finish once you have covered Multivariate Regression units 02-12 to 02-15

Task 1.A (4 points)

(a) (2 points) Use the command data(CASchools) to load the CASchools dataset. This creates a data.frame in your session of R called CASchools.

(b) (0 points) Typing (directly into the console) ?CASchools will bring up a help window that will tell you about each variable in the data. It wont appear in your Rmarkdown output. Do this to 1 read the variable descriptions. Were going to work alot with calworks so take note of the units for that variable.

(c) (2 points)And finally, lets get a count of observations by county to see which counties have the most observations. Youll want to use the command table(...) on the column of CASchools that contains the county data. Question 1.A: Load and explore data (5 points) Please answer the following questions. Create a header in your template using ##

Question 1.A, and label each of your answers with the corresponding letter (a)-(d).

(a) (1 point) How many observations are in the data?

(b) (2 points) We are interested in test scores. Which variable(s) in CASchools would be our outcome of interest?

Task 1.B: Data Cleaning (9 points) This task will create some useful variables and run our first regression. We get to rely on lm(...) for our regressions now, no more doing them by hand! (a) (1 point) Many education experts think that the average student:teacher ratio is important in test scores. Create a variable called studentTeacherRatio in the CASchools data.frame. Always create the variables in the data.frame do not leave them out in the environment.

(b) (2 points) Lets keep only the three counties with the most observations: Sonoma, Los Angeles, and Kern. Create a conditional called bigCounties that is TRUE if the variable county is any of these three counties. Remember that | is the or logical operator.

(c) (1 point) Subset CASchools so that it contains only the rows (observations) from those three counties. Remember that we subset using CASchools[rowIndex, colIndex] and we select all rows or columns by leaving the index blank.

(d) (2 points) Make two scatterplots using plot(...). Plot math scores (y-axis) against student:teacher ratios (the variable we created in the first part of this task) on the x-axis. Plot the points in green using col="green" in your plot. Plot reading scores against the same variable as well

(e) (2 points) Make two more plots (for a total of four) with math and reading plotted against income. Plot these points in blue.

(f) (1 point) Well make one last plot. This one will be of income against student:teacher ratios, so that we can see if higher-income schools have lower student:teacher ratios. Lets color-code each point by county. To do this, add col = as.factor(CASchools$county) in your call to plot(...). R will make each point a different color.

Question 1.B (7 points)

(a) (2 points) Using your first two plots, does there appear to be a relationship between higher student:teacher ratios and math scores? What about reading scores?

(b) (2 points) Using your last two plots, does there appear to be a relationship between higher income and math scores? What about reading scores?

(c) (3 points) Using the final plot, does it appear that some counties have higher income or higher student:teacher ratios (or both)? Discuss.

2 Task 1.C (10 points) Lets run some regressions. Before we do that, remember that we almost always want to use heteroskedasticityconsistent errors. See lecture slides 02-07 for instructions. Someone has pointed out that we can combine the steps. coeftest lets us say what errors we want beforehand: coeftest(lm(Y ~ X1+X2, df), vcov = vcovHC, "HC1") This gives the same result as above, but in one line. The vcov and "HC1" are always the same. We think there might be a relationship between the share of students receiving aid from calworks, the State of Californias aid program for families with children.

(a) (10 points) Start by running a regression of read on calworks. That is read = 0 + 1calworks. Use an output that shows heteroskedasticity-robust standard errors.

Question 1.C (8 points)

(a) (3 points) What is the coefficient on calworks and what does it mean? Note that calworks is in percentage points (you can see the range using range(CASchools$calworks))

(b) (5 points) What potential omitted variables might bias this coefficient? That is, is there something unobserved correlated with calworks that might also be correlated with read?

Task 1.D (5 points) We might worry that there is something unobserved about the counties that is common in all of the districts within them. After all, Kern County is a rural, agricultural county, while Los Angeles is urban and highly populated.

(a) (5 points) Run a regression that controls for anything that is common within all observations in a county. Make sure you use heteroskedasticity-robust errors in your output. Question 1.D (5 points) (a) (2 points) What is the new coefficient on calworks? Is it larger or smaller?

(b) (1 point) What is the base county level?

(c) (2 points) What is the expected reading score for an observation in Sonoma County, at a schools with a calworks value of 25%?

Task 1.E (5 points) About that assumption of heteroskedasticity. . . We might want to stop right here and ask do we really need the heteroskedasticity-consistent errors? Lets implement the Breusch-Pagan Test for Heteroskedasticity (slide 108 of 02-Multivariate Regression).

(a) (2 points) Using the regression in Task 1.D, create a new column in CASchools called uhat that contains the residuals. Then, create a new column called uhat2 containing the square of the residual u 2 . Note: the command residuals(myReg) will take the results of any regression using lm and return the residuals, u for the original data used in the regression.

(b) (3 points) Run the appropriate regression (see slide 108 of 02-Multivariate Regression) and show the results

Question 1.E (5 points)

(a) (2 points) What is the relevant output of the regression summary for our Breusch-Pagan Test H0: No Heteroskedasticity present 3

(b) (3 points) What is your interpretation of the results? Should we be using heteroskedastic errors?

Task 1.F (2 points) The lmtest package has its own function for a Breusch-Pagan test. It works pretty similar to ours, but yields slightly different answers. Use ?bptest to see how to use this function and run it on the regression from Task 1.D.

Question 1.F (2 points)

(a) (1 point) What is the interpretation of the result from this version of the Breusch-Pagan test?

(b) (1 point) Is it the same (or very close) to our results from Task 1.E?

Task 1.G (8 points) We might think that, within each county, schools with higher shares of students who speak english as a second language might have different reading scores. Furthermore, we might think that the effect of calworks on read is different for districts with higher shares of english as a second language.

(a) (8 points) Run a regression with this interaction between calworks and english. Include county fixed effects as well, but do not interact the county fixed effects with anything.

Question 1.G (15 points)

(a) (4 points) What is the effect of an increase in calworks by one unit for a school with 0% english learners (english=0)? Remember to consider both the main effect of english and the interaction. See lecture notes on continuous x continuous interactions.

(b) (4 points) What is the effect of an increase in calworks by one unit for a school with 40% english learners?

(c) (7 points) What is the formula you used to determine dRead dCalworks ? Hint: it includes the variable english. Write it using LaTeX. One of the advantages of using LaTeX with Rmarkdown is using the math equations functionality of LaTeX. To help you write the answer to this question, here are some LaTeX methods: First, to write in LaTeX notation, you wrap your math in $s (start and end), like this: $2+2$ which will produce in your output the: 2 + 2 To write a greek letter, you put a slash before the name of the letter: $\beta + \gamma$, which results in + The * character doesnt display well in LaTeX, so we use a special character for multiplication $\times$ which gives \times. To add a subscript, like on a coefficient, you use _{sub}: calworks calworks + LosAngeles gives calworks + LosAngeles. In subscripts, all spaces are removed. To make a fraction, you use $\frac{numerator}{denominator}$ which gives numerator denominator To put it on one line as a standalone equation, you wrap it in double $$ So $$\frac{dRead}{dCalworks} = $$ will give you: dRead dCalworks = - Start from there and use proper notation to write your answer to (c)!