Question
1 STAT 7000 Assignment 3 (Fall 2019) Instructions: The due date: November 15, 2019. Solutions must be typed. Use R to answer the questions. Instead,
1
STAT 7000 Assignment 3 (Fall 2019)
Instructions:
The due date: November 15, 2019.
Solutions must be typed. Use R to answer the questions. Instead, pick the values you need to include in your solution writeup.
Please submit your HW paper as pdf or docx file to CANVAS by designated date.
Data are from McDonald and Schwing (1973), "Instabilities of Regression Estimates Relating Air
Pollution to Mortality," Technometrics, 15, 463-481. This data set consists of 15 independent
variables (see list below) and a measure of mortality on 60 US metropolitan areas in 1959-1961.
Description of Variables
Y Total Age Adjusted Mortality Rate
x1 Mean annual precipitation in inches
x2 Mean January temperature in degrees Fahrenheit
x3 Mean July temperature in degrees Fahrenheit
x4 Percent of 1960 SMSA population that is 65 years of age or over
x5 Pop per household, 1960 SMSA (Standard Metropolitan Statistical Area)
x6 Median school years completed for those over 25 in 1960 SMSA
x7 Percent of housing units that are found with facilities
x8 Population per square mile in urbanized area in 1960
x9 Percent of 1960 urbanized area population that is non-white
x10 Percent employment in white-collar occupations in 1960 urbanized area
x11 Percent of families with income under 3,000 in 1960 urbanized area
x12 Relative population potential of hydrocarbons, HC
x13 Relative pollution potential of oxides of nitrogen, NOx
x14 Relative pollution potential of sulfur dioxide, SO2
x15 Percent relative humidity, annual average at 1 p.m.
Data are given in the file "pollution.txt". We are interested in prediction and describing the
relationship between the mortality rate
2
The data file can be downloaded at the following link.
http://www.auburn.edu/~billone/datasets/stat7000/pollution.txt
1. Create scatter plot matrix and examine all the pairwise relationships. Comment on these.
2. Construct ANOVA table and tell me what the variation measures tell you.
3. Write down the hypotheses for the significance of regression.
4. Is the regression significant? Why?
5. By examining the individual hypothesis testing for the fifteen regression coefficients, which
regression coefficients are statistically significant? List these.
6. Find the correlation matrix and tell me if there is multicollinearity problem.
7. Perform best subset selection method by using R2
adj , Cp, MSE, AIC and BIC criteria ( The
smaller last four criteria are the better model is!).
8. Perform sequential procedures: forward , backward and stepwise variable selection.
9. Find the optimal (s) model(s).
10. Check if there is a collinearity issue in the optimal model(s).
11. Check the model(s) adequacy(ies) of the optimal model(s) (that is, are the assumptions
satisfied?).
12. Write one paragraph summarizing what your findings.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started