Question
# Problem 1 The divorce rate in the United States during the years 1920-1996 can be modeled with the quantitative variables listed below. In Problem
# Problem 1
The divorce rate in the United States during the years 1920-1996
can be modeled with the quantitative variables listed below.
In Problem 1 you will examine an assumed linear model of
the divorce rate as a function of socio-economic characteristics (as predictors).
We will assume i.i.d. normal errors for the response
values, and unknown constant variance of the errors.In the following
questions, use the data as-is; do not remove any outliers.
The datasets for the problems can be found here: https://drive.google.com/drive/folders/1CI6kWaDpYQBt3V22s7XD8T872lxOPl7I?usp=sharing
Please Use R and explain Along the way
Read the data from `divusa.txt` into R. Use
divusa <- read.table("divusa.txt", header =
T,sep=',')
The data description is as follows:
- `divorce`: divorce per 1000 women aged 15 or more
- `unemployed`: unemployment rate
- `femlab`: percent female participation in labor force aged
16+
- `marriage`: marriages per 1000 unmarried women aged 16+
- `birth`: births per 1000 women aged 15-44
- `military`: military personnel per 1000 population
## Part (a)
Demonstrate a
numerical summary of the data, and use the function `pairs()` (in base R) to show a
graphical summary of the data. Do you see anything that looks promising for
modeling? Do you see anything that may alert you to potential problems? Limit
your answer to one or two sentences.
## Part (b)
Fit a linear model to predict the variable `divorce` from the
variable `femlab`.
## Part (c)
What *specific* hypothesis is being tested with the p-value
given for the slope coefficient in the output in part (b)? (State the null and
alternative hypotheses).Do you accept or reject the null-hypothesis, and
on what basis?
## Part (d)
What is the sample size?
## Part (e)
Does the intercept term have a useful interpretation, in terms
of the model? Explain in one or two sentences.
## Part (f)
What percentage of variation in the data is not explained by the
model?
## Part (g)
Plot the standardized residuals against the response variable
and the predictor variable, and produce a Q-Q plot of the standardized
residuals. What can we conclude about the normality of the errors, the
constancy of the error variance, and the relationship between the errors and
the variable?
## Part (h)
What is the estimated mean divorce rate when femlab =
38?
## Part (i)
Demonstrate a 97%
prediction interval around the mean response estimated in part (i).
## Part (j)
Demonstrate a 97%
confidence interval for$\beta_1$, the slope coefficient.
## Part (k)
Suppose that the percent of female participation in the labor
force increased by 13 from one year to the next.What would be the
predicted change in the US divorce rate?
# Problem 2
Download the data set `Tree.txt`. Collected by Bruce and
Schumacher, this classic dataset measures the diameter (x, in inches) and
volume (y, in cubic feet) of shortleaf pines.
Load the data using `tree <- read.table("Tree.txt",
header = T)`
tree <- read.table("Tree.txt", header = T)
## Part (a)
Fit a simple linear regression model for predictor diameter and
response volume.
## Part (b)
Assess the appropriateness of the model fit using model
diagnostics. Limit your response to two or three sentences.
## Part (c)
Fit a simple linear regression to the log-log transformed data
(take the natural logrithm of both the response and predictor variable)
## Part (d)
Produce a Q-Q plot of the standardized residuals from the
transformed model in part (c), and plot the standardized residuals against the
response variable and the predictor variable. What can we conclude about the
normality of the errors, the constancy of the error variance, and the
relationship between the errors and the variable?
## Part (e)
What is the nature of the relationship between the diameter and
volume of shortleaf pines? Is there a significant association between the
diameter and volume?
## Part (f)
Interpret the coefficient $\beta_1$, in terms of the
model.
## Part (g)
What is the expected volume of a tree with an eleven inch
diameter?
# Problem 3
Download the semiconductor photomask line-spacing data. The data
includes measurement errors for measurements taken at different line spacing.
It appears that the precision of the line-spacing measurements decreases as the
line spacing increases.
- `line_space`: The line spacing for the observation.
- `measurement_error`: The measurement error for that
observation.
- `sd`: The standard deviation for the $Y_i$ of each
observation.
Read in the data using
photomask <- read.table("measurements.txt", header
= T)
## Part (a)
Why would the Weighted Least Squares model be appropriate in
this situation?
## Part (b)
Represent
weighted least squared regression model to predict the measurement error for a
given line spacing by giving the weights directly to the `lm` function as
`weights = `.
## Part (c)
Is this model significant at $\alpha = 0.001$?
## Part (d)
Use your model from Part (b) to find a 95\% prediction interval
for a new measurement taken at a line spacing of 1.99.
## Part (e)
Why is the prediction interval in part (d) untrustworthy? Is the
interval at this location going to be too small or too large?
## Part (f)
Build a new model that incorporates the weights into the
variables for a LS model.
## Part (g)
Use your model from Part (f) to find a more accurate prediction
error for measurement error at line_space = 1.99. Use a standard deviation of
0.013.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started