Q9some type is wrong, can refer to the attach file. In a study of the relationship between an explanatory variable Z and a response variable
Q9some type is wrong, can refer to the attach file.
In a study of the relationship between an explanatory variable Z and a response variable Y , n independent observations are taken in the form of (yi, zi), i = 1, 2, . . . , n, for the variable pair (Y,Z). Assume that z1, z2, . . . , zn are nonrandom constants withn i=1 zi = 0, and Y = Z + , N(0, 2), for some unknown parameters and . (a) Write down the design matrix X for the above model. (b) Show that (XX)1X[y1, . . . , yn] =n j=1 zjyj n j=1 z2j. What is the use of the above formula?
Let be the least squares estimator of . Show that is normally distributed with mean and variance 2j=1 z2j . (d) Define the residual i for the ith observation. (e) What is the distribution of the residual sum of squares SSE =ni=1 2i ? Is SSE independent of ? (f) Using (c) and (e), find the distribution of T = j z2j SSE/(n 1)if = 0. (g) Suppose that each observation yi has been mis-recorded as 10yi. Based on these misrecorded responses, (i) show that the least squares estimate of is 1 = 10 , where is defined as in (c); (ii) show that the residual sum of squares SSE1 is 100 SSE, where SSE is defined as in(e); (iii) show that if = 0, 1j z2j SSE1/(n 1)has the same distribution as T (defined in (f)). (h) Suppose that each observation yi has been mis-recorded as yi + 10. Based on these misrecorded responses, (i) show that the least squares estimate of is 2 = , where is defined as in (c); (ii) show that the residual sum of squares SSE2 is SSE + 20ni=1yi + 100 n,where SSE is defined as in (e); (iii) show that if = 0, 2j z2j SSE2/(n 1)does not have the same distribution as T (definedin (f)).
SMSL/2015 THE UNIVERSITY OF HONG KONG DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE STAT2301/STAT3600 LINEAR STATISTICAL ANALYSIS Assignment 2 1. Let [Y1 , Y2 ] be a bivariate random vector with [ [ 1 5 1 E ([Y1 , Y2 ] ) = and Var ([Y1 , Y2 ] ) = . 2 1 2 Dene Z [Z1 , Z2 , Z3 ] = [ Y1 + 2Y2 2, 2Y1 Y2 + 1, Y1 4Y2 4 ] . Z ]. (a) Calculate E [Z Z ). Hence deduce the variances of Z1 , Z2 , Z3 and the covariance between (b) Calculate Var(Z Z1 and Z3 . 2. Let Y1 1 3 1 0 Y2 N3 1 , 1 3 0 . Y3 0 0 0 2 (a) What is the distribution (i) of Y1 + Y2 + Y3 ? [ Y2 + Y3 (ii) of ? Y1 Y2 Y3 [ (b) Explain why Y3 is independent of Y1 Y2 . , ) with nonsingular. Let A be a d n constant matrix of rank d. 3. Suppose Y Nn ( Y ). (a) Determine the distribution of A (Y Y ). AA ) 2 A(Y (b) Determine the distribution of (A 1 (c) Using (b) or otherwise, show that Y )A (A AA )1A (Y Y ) 2d . (Y 1 , ), and there exists an m n matrix Q such that 4. Suppose Y Nn ( Q = I m Q = 0. and Q Y. Dene Z = QY (a) What is the distribution of Z ? Z 2 = Z Z ? (b) What is the distribution of Z 5. Suppose Y N3 (0, I 3 ). Let 1 0 A 1 = 2 1 1 3 5 and A 2 = 3 . 1 1 A (a) Calculate B i = A i (A i A i ) A i for i = 1, 2. (b) Using Cochran's Theorem or otherwise, deduce the distributions of Y B 1Y and Y B 2Y . Are they independent? 6. Let Y N3 (0, 2I 3 ) be a multivariate normal random vector. Dene 2 1 1 1 A = 1 2 1 . 3 1 1 2 What is the distribution of Y AY ? [Hint: apply Cochran's Theorem to matrices A and I 3 A .] 7. Let X 1 , . . . , X k be n p1 , . . . , n pk matrices respectively, with p1 + + pk y), (iv) P (Z/ X t), (v) P (|Z| > t X). (c) Calculate the 2.5% upper quantile of Z/ X. Show that its square equals y. Explain. 9. In a study of the relationship between an explanatory variable Z and a response variable Y , n independent observations are taken in the form of (yi , zi ), i = 1, 2, . . . , n, for the variable pair (Y, Z). Assume that z1 , z2 , . . . , zn are nonrandom constants with n i=1 zi = 0, and Y = Z + , N (0, 2 ), for some unknown parameters and . (a) Write down the design matrix X for the above model. (b) Show that n zj yj X X )1X [y1 , . . . , yn ] = j=1 (X . n 2 j=1 zj What is the use of the above formula? (c) Let be the least squares estimator of . Show that is normally distributed with mean and variance 2 / nj=1 zj2 . (d) Dene the residual i for the ith observation. (e) What is the distribution of the residual sum of squares SSE = Is SSE independent of ? 3 n 2i ? i=1 2 j zj (f) Using (c) and (e), nd the distribution of T = if = 0. SSE/(n 1) (g) Suppose that each observation yi has been mis-recorded as 10yi . Based on these misrecorded responses, where is dened as in (c); (i) show that the least squares estimate of is 1 = 10, (ii) show that the residual sum of squares SSE1 is 100 SSE, where SSE is dened as in (e); 2 1 j zj (iii) show that if = 0, has the same distribution as T (dened in (f)). SSE1 /(n 1) (h) Suppose that each observation yi has been mis-recorded as yi + 10. Based on these misrecorded responses, where is dened as in (c); (i) show that the least squares estimate of is 2 = , (ii) show that the residual sum of squares SSE2 is SSE + 20 n yi + 100 n, i=1 where SSE is dened as in (e); 2 2 j zj (iii) show that if = 0, does not have the same distribution as T (dened SSE2 /(n 1) in (f)). 10. The following dataset contains n independent observations of the d+1 variables (Y, X1 , . . . , Xd ). Y Y1 Y2 .. . X1 x11 x21 .. . X2 x12 x22 .. . .. . Xd x1d x2d .. . Yn xn1 xn2 xnd For model-tting purposes, the variable Y is treated as the response and X1 , . . . , Xd as explanatory variables. A general linear model is proposed for Y such that Yi = 1 xi1 + + d xid + i , 4 i N (0, 2 ). (a) Dene the design matrix X for , and the corresponding parameter vector . (b) Find the distribution of the least squares estimate = [1 , . . . , d ] of . (c) It is known that the residual sum of squares SSE = n ( Yi 1 xi1 d xid )2 i=1 follows the 2 2nd distribution. Describe how you would construct a 100(1 )% prediction interval for the dierence between two individual responses, one observed at (X1 , . . . , Xd ) = (b1 , . . . , bd ) and the other at (X1 , . . . , Xd ) = (c1 , . . . , cd ). (d) Suppose it is known further that satises q ( 0. Would you reject it at the 5% level? (g) Calculate a 95% two-sided condence interval for 2 . (h) Calculate a 95% one-sided condence interval for 2 in the form of [L, ). 12. Consider the multiple linear regression model Yi = 0 + 1 xi1 + 2 xi2 + i , iid i N (0, 2 ), i = 1, 2, . . . , 6. In an experiment, the following data are observed: i 1 2 3 4 5 6 xi1 2 1 0 0 1 2 xi2 4 2 1 1 2 4 Yi 8 5 1 2 6 10 (a) Find the least-squares estimates of 0 , 1 , 2 . (b) Compile an ANOVA table for the multiple linear regression, in which the total corrected sum of squares is partitioned into a regression sum of squares and a residual sum of squares. 6 (c) Calculate the coecient of determination R2 . Comment on the t of the model. (d) Construct a 95% condence interval for 2 . (e) Construct a 95% condence interval for 20 + 1 2 . (f) Construct simultaneous condence intervals for 2 and 20 + 1 2 , so that they have a family condence level at least 95%, using either Bonferroni's or Schee's methods, whichever is more precise. (g) Construct a 95% prediction interval for a future response at X1 = 0 and X2 = 2. (h) Construct a 95% prediction interval for the dierence between two future responses, one taken at X1 = X2 = 2 and the other taken at X1 = 1 and X2 = 3. (i) Suppose we wish to test the null hypothesis that 20 + 1 2 = 2 based on an F -test of size 5%. Calculate the p-value for the test. Do you have evidence against the null hypothesis? (j) Using your answer to (e), suggest an alternative method for doing the test in (i). ***************************************************** Note: In the following questions, random errors can be assumed to be i.i.d. under N (0, 2 ). 13. A dataset records the per capita income of 20 European OECD countries for the year 1960, as well as the percentages of the labor force employed in agriculture, industry, and services for each country. There are 4 variables: PCINC = Per capita income, 1960 ($), AGR = Percent of labor force in agriculture, 1960, IND = Percent of labor force in industry, 1960, SER = Percent of labor force in service occupations, 1960. An EXCEL spreadsheet containing the above data is available on Moodle. (a) Draw scatterplots of PCINC against each of the other three variables. Describe any observable trends in your plots. (b) A multiple linear regression model is tted to the data by regressing PCINC on AGR, IND and SER. 7 (i) Calculate the least squares estimates of the three regression coecients. How are their signs related to your visual observations in (a)? (ii) Compile an ANOVA table for the multiple linear regression with three items: regression, residual and total corrected sums of squares. (iii) What is the p-value for the regression? What is your conclusion about the signicance of the variables AGR, IND and SER? (c) A reduced model is dened by deleting one of the three explanatory variables from the full multiple linear regression model. What is the p-value for the F test of this reduced model against the full model if (i) the deleted variable is AGR? (ii) the deleted variable is IND? (iii) the deleted variable is SER? Do your conclusions contradict that given in (b)(iii)? Why? (d) To assess the collective signicance of IND and SER in the multiple linear regression model, one may compare the simple linear regression model which regresses PCINC on AGR with the full multiple linear regression model including all 3 explanatory variables. (i) Compile an ANOVA table relevant to the above comparison. (ii) Calculate the p-value for the F test of the simple linear regression model against the full multiple linear regression model. What is your conclusion about the signicance of the variables IND and SER? (iii) Is the variable AGR signicant in the simple linear regression model? (iv) Based on your conclusions in (d)(ii) and (d)(iii), decide on the optimal model which you think is most suited to the data. (v) Calculate the least squares estimate of and construct a 95% condence interval for the regression coecient associated with AGR based on the optimal model given in (d)(iv). Is your answer consistent with the trend observed in (a)? 14. The taste of matured cheese is related to the concentration of several chemicals in the nal product. In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. 8 The data set consists of 30 samples of mature cheddar cheese. Observations were made on 4 variables: Taste subjective taste test score, Acetic natural logarithm of concentration of acetic acid, H2S natural logarithm of concentration of hydrogen sulde, Lactic concentration of lactic acid. An EXCEL spreadsheet containing the above data is available on Moodle. It is well accepted that the chemicals 'H2S' and 'lactic acid' contribute signicantly to the good taste of cheddar cheese. To investigate whether 'acetic acid' also aects the taste of cheddar cheese, two multiple linear regression models were tted to the 'taste' data, yielding the following results: Reduced model Full model Explanatory variables H2S, lactic acetic, H2S, lactic Residual sum of squares 2668.97 2668.41 The total corrected sum of squares was calculated to be 7662.89. (a) Fill in the following ANOVA table: Source Regression tting reduced model Extra Residual tting full model Total s.s. d.f. m.s. F -ratio ? ? ? ? ? ? ? ? ? ? ? (b) Calculate the p-value associated with the signicance of 'acetic acid'. (c) Do you think 'acetic acid' should be included in the multiple linear regression model? Justify your answer by testing for the signicance of 'acetic acid' (i) at the 5% level, (ii) at the 10% level. (d) Regressing 'taste' on 'acetic acid' alone yields a residual sum of squares equal to 5348.75. Test at the 5% level for the signicance of 'acetic acid' under this simple linear regression model. Does your conclusion contradict that given in (c)? Comment. 9 15. Researchers interviewed 9 subjects to investigate the relationship among blood pressure, diet, and tness. Four variables were measured for each subject: BP Over Wgt Fats Exercise Number Number Average Average of points above normal diastolic blood pressure, of kilograms overweight, number of grams saturated fatty foods consumed per day, number of minutes of exercise per day. An EXCEL spreadsheet containing the above data is available on Moodle. (a) Formulate a multiple linear regression model for the dataset, using BP as the response and the remaining three factors as regressors. (b) Obtain scatterplots of blood pressure against each of the three factors. (c) Calculate the R2 statistic for the model. Do you think the model is adequate to explain the variation of blood pressure among the subjects under study? (d) Calculate lse's for the regression coecients and their respective standard errors. (e) Construct 95% condence intervals for the regression coecients. Specify precisely your assumptions which guarantee accuracy of these intervals. (f) Construct a 95% prediction interval for the dierence in blood pressure between two individuals, one being 1 kg overweight, consuming 9 g of saturated fatty foods and exercising for 5 minutes on average per day, while the other being 5 kg overweight, consuming 1 g of saturated fatty foods and exercising for 1 hour on average per day. What will be your advice for an overweight person based on the above interval? (g) Test at 5% level for signicance of the three factors collectively. What would your answer be if you choose to test at 1% level instead of 5%? 16. This question refers to a dataset containing the price, area in square feet, acres, numbers of rooms and baths of 150 randomly-selected houses. An EXCEL spreadsheet containing the above data is available on Moodle. (a) Produce scatterplots of price against each of the four factors. What do you observe? (b) A multiple linear regression model is proposed as a full model for the data, using \"price\" as the response and the other factors as the explanatory variables. 10 Calculate the F -ratio for testing a reduced model which is obtained by removing the variable \"number of rooms\" from the full model. Does the number of rooms have a signicant eect on house price? (c) It is argued that when tting a multiple linear regression model to the data, using \"price\" as the response and the other factors as the explanatory variables, the intercept term 0 should be set to zero. Is this argument reasonable? Why? Based on the no-intercept model described in (c), answer questions (d)-(i) below. (d) Calculate the lse's of the regression coecients. (e) Construct a 95% prediction interval for the price of a house of 2,000 square feet in area, 0.5 acres, 8 rooms and 2 baths; (f) Construct simultaneous condence intervals for the four regression coecients, of family condence level at least 95%. Explain your choice of method for constructing these intervals. (g) Calculate the F -ratio for testing a reduced model which is obtained by removing the variable \"number of rooms\" from the full no-intercept model. Does the number of rooms have a signicant eect on house price? Why is your answer not consistent with that obtained in (b)? (h) A reduced model is proposed in which house price depends only on the total area (dened as AREA + 43562 ACRES since 1 acre = 43562 square feet), as well as on its total number of rooms and baths (dened as BATHS + ROOMS). Describe how you can represent the above reduced model by imposing two extra constraints on the regression coecients of the no-intercept model. (i) Conduct a hypothesis test to test if the reduced model proposed in (h) is adequate to explain the variation in house prices. You should specify clearly the null and alternative hypotheses, and calculate the p-value for the test. 17. The 1991 National Hospital Discharge Survey reports the numbers (in thousands) of heart surgeries, including bypass and angioplasty, for the years 1980 to 1991 as follows: Year Surgeries (in thousands) 1980 196 1981 217 1982 243 1983 275 11 1984 314 1985 379 1986 490 1987 588 1988 674 1989 719 1990 781 1991 839 (a) Sketch a scatterplot of the \"number of surgeries\" (in thousands) against \"year\". (b) A cubic polynomial regression model is suggested for the observed data, treating \"number of surgeries\" (in thousands) as the response and \"year\" as the explanatory variables. (i) Formulate the model and state clearly the assumptions involved. (ii) Describe the design matrix for the model . (iii) Suppose that the residual and total corrected sums of squares are found to be SSE = 2574.661 and Syy = 611610.250 respectively when tting to the data. Calculate the coecient of determination R2 for . Does provide a good t? (c) It is hypothesized that a simple linear regression model may be adequate to explain the observed data. (i) Formulate the model . (ii) Fitting to the data yields a residual sum of squares to be SSE = 19142.976. Calculate the coecient of determination R2 for . Does provide a good t? (d) Would you adopt the simple linear regression model in place of the cubic polynomial regression model ? Justify your answer by means of a 5% signicance test. 18. Data are collected on the average January minimum temperature (JanTemp) in degrees Fahrenheit with the latitude (Lat) and longitude (Long) of 56 U.S. cities. The longitude is measured in degrees west of the prime meridian, and the latitude is measured in degrees north of the equator. An EXCEL spreadsheet containing the above data is available on Moodle. A multiple linear regression model is proposed for the response variable JanTemp as follows: JanTemp = 0 + 1 Long + 2 Long2 + 3 Long3 + 4 Long4 + 5 Lat + , N (0, 2 ), for unknown parameters 0 , 1 , . . . , 5 , . (a) With reference to the scatterplots below, describe how JanTemp is related to Lat and Long. 12 60 40 40 JanTemp JanTemp 60 20 20 0 0 30 40 50 Lat 70 80 90 100 Long 110 120 (b) Calculate the residual sum of squares tting model to the data, and determine its associated degrees of freedom. (c) Calculate the regression sum of squares tting model to the data, and determine its associated degrees of freedom. (d) Calculate the coecient of determination R2 . Do you think that model is adequate to explain the variability in these data? (e) Reduced models are set up by deleting powers of Long successively. Fill in the residual sums of squares tting the respective reduced models below: Variables deleted Long4 Long4 , Long3 Long4 , Long3 , Long2 Long4 , Long3 , Long2 , Long Residual sum of squares ? ? ? ? (i) Test at the 5% level whether Long4 is signicant in the model . (ii) Test at the 5% level whether Long4 and Long3 are collectively signicant in the model . (iii) Deduce from your answers to (i) and (ii) above a reduced model which is adequate to explain the variability in these data. 13 (f) Calculate the least squares estimates of 0 , 1 , . . . , 5 under model . (g) Calculate a 95% prediction interval for the average January minimum temperature of New York in the future based on . 19. A dataset concerns the price per capita of beef annually from 1925 to 1941 together with other variables relevant to an economic analysis of the price of beef. It contains the following variables: YEAR = Year to which the data refer; PFO = Retail food price index; DINC = Disposable income per capita index; CFO = Food consumption per capita index; RDINC = Index of real disposable income per capita; RFP = Retail food price index adjusted by the CPI; PBE = Price of beef (cents/lb). An EXCEL spreadsheet containing the above data is available on Moodle. A multiple linear regression model is proposed to describe the relationship between the response variable PBE and the other 6 explanatory variables (YEAR, PFO, DINC, CFO, RDINC, RFP). An agriculturalist believes, however, that the variation in PBE can be adequately explained by the variable CFO alone, and hence proposes a simple linear regression model for the data. (a) Specify the models and , and state the model assumptions clearly. (b) Calculate the residual sums of squares tting and respectively. (c) Calculate the least squares estimates of the regression coecients under model . (d) Suppose that we predict the explanatory variables to have the following gures in the year 2015: PFO DINC 200.0 200.0 CFO RDINC 200.0 220.0 RFP 2000.0 Calculate a predicted value of PBE in 2015 based on the model . Comment on the reliability of your answer. 14 (e) Compile an ANOVA table for a hypothesis test of against . (f) Test at the 5% signicance level whether the agriculturalist's belief is correct. Dene models 1 , . . . , 5 as follows: Model 1 2 3 4 5 Expression for E [PBE] 0 + 1 CFO + 2 YEAR + 3 CFO YEAR 0 + 1 CFO + 2 PFO + 3 CFO PFO 0 + 1 CFO + 2 DINC + 3 CFO DINC 0 + 1 CFO + 2 RDINC + 3 CFO RDINC 0 + 1 CFO + 2 RFP + 3 CFO RFP Five hypothesis tests are carried out to test the simple linear regression model against i , for i = 1, . . . , 5, respectively. (g) The above ve hypothesis tests are useful for detecting certain eects concerning the explanatory variables. What are these eects? (h) Calculate the F -ratios and their corresponding p-values for the ve tests, respectively. (i) After studying the p-values obtained in (h), it is recommended that among the ve choices 1 , . . . , 5 , one should select 1 for subsequent analysis. Give a reason for this choice of model. (j) Based on the tted model 1 , plot four tted regression lines on the same diagram to display the relationships between PBE and CFO in the years 1925, 1930, 1935 and 1940, respectively. Comment on the changes in the relationships between PBE and CFO across the period 1925-1940, i.e. years leading to the Second World War. 15
Step by Step Solution
There are 3 Steps involved in it
Step: 1
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started