Question

1 Approved Answer

Posted on Jul 08, 2024

Homework 5 (Due Date: Sunday Oct 17, 2021 11:59 PM) Make sure you always include your R-work (R codes you used) and/or hand work (calculations

Homework 5 (Due Date: Sunday Oct 17, 2021 11:59 PM) Make sure you always include your "R-work" (R codes you used) and/or "hand work" (calculations made by hand) and NOT just your answers. Failing to provide your work might result in you losing points. On April 20th your supervisor comes to you and suggests that one can predict student's productivity in the next month from his/her productivity in the month prior to the current month. You suggest that you can test this idea using simple linear regression. To do this, you pull productivity data on a student from January, and see if it can predict the March productivity. 1. Using "Jan Mar.cav" provided on Brightspace, consider each row as a different worker's productivity data. The first column is labelled January - that is the January productivity data. The other 26 columns represent first letter of your name. You only need to pick the column that has the same letter as your first name; this is your "y" or response data for the first part of this homework. e.g. a student named Mary would choose column 'M'. So, you should have 2 columns (January and 'M') in your data frame. Rename your first name initial column "March" a) Plot March vs. January using a scatterplot with regression line (You can use Microsoft Word to estimate the line). Does a linear relationship appear to be appropriate in this case b) Run a simple linear regression of March (y) vs. January ("predictor"), and store the residuals as 'Resid.1' c) Plot the diagnostics for the linear regression model and plot the residual data vs observation order. Using those plots, indicate whether you believe this regression satisfies the assumptions regarding a) normality of residuals (b) homoscedasticity (c) linearity (d) independence of residuals. Be sure to indicate, for each of (a) through (d), what plot you are using, for what you are looking, and what you see in that plot. (Do NOT simply write that the assumptions appear satisfied or do not appear satisfied - tell us how you came to that conclusion.) d) Using R to get a summary of your linear regression and append that section to your homework. i. Is the regression significant? Why or why not? ii. Is Bo significant? Why or why not? iii. Is B 1 significant? Why or why not? iv . Interpret the R value. V . In your own words, explain what the regression model tells you about whether one can predict the productivity of a student based on past performance.vi. Explain any unusual observations flagged by R. Why is it considered unusual by R? Should something be done about it? vii. Identify what your model suggests the March productivity would be of a student who had a January productivity of 70. 2. Your supervisor then suggests that student productivity could instead be related to the number of months on the job. On the "MonthsonJob.csv" file, you will find a column of data titled, "MonthsonJob' a) Plot MonthsonJob vs. March using a scatterplot with regression line. Indicate whether a linear relationship appears to be appropriate in this case. b) Run a simple linear regression of Months on job vs March and store the residuals as 'Resid.2' Plot the diagnostics for the linear regression model. c) Plot the residual data of the simple linear regression model with observation order. Using the above plots, indicate whether the assumptions of the regression appear appropriate. d) Now take a square root of Months on job using R, name it Sqrt_Months_on_the_Job. Using Sqrt_Months_on_the_Job vs. March, re-run the regression, redo part a-c. Indicate whether the transformed model is better in some way than the non- transformed model, and identify why you think it is better (or not). 3. Lastly, the worker's union representative suggests that worker on-the-job injuries might be predicted by productivity. That is, if workers work too hard (high productivity), they tend to get injured. On the "Injuries.csv" file, you will find a column of with your name's initial letter as header. That data is binary - Yes (1) or No (0), indicating whether they had an on- the-job injury in March. Select the one with your initial name at the top, add that column into your March and January data, and name that column "Injured" a) Plot Injured vs. March using a scatterplot. Indicate whether a binary logistic relationship appears to be appropriate in this case. If so, indicate at what level of productivity the likelihood of injury increases. b) Run a binary logistic regression of Injured vs. March. Indicate whether the regression is significant or not, indicating why you think that is the case. d) Regardless of your answer to 3c, what does your regression suggest is the chance of injury if the productivity is 100? Be sure to include a confidence interval on your estimate. (Hint: Use the prediction on the binary logistic regression.) 4. The regression equation is Y = 26.8+1.48x Predictor Coefl SE Coeff P Constant 26.753 2.373 X 1.4756 0. 1063 S = 2.70040 R-sq=93.7% R-sq(adj)=93.2% Analysis of Variance Source DF SS MS Regression Residual Error 94.8 7.3 Total 15 1500a) Fill in the missing information. You may use bounds for the P-values. b) Can you conclude that the model defines a useful linear relationship? c) What is your estimate of o?? 5. An article in the Journal of the American Statistical Association ["Markov Chain Monte Carlo Methods for Computing Bayes Factors: A Comparative Review" (2001, Vol. 96, pp. 1122-1132)] analysed the tabulated data on compressive strength parallel to the grain versus resin-adjusted density for specimens of radiata pine. a) Fit a regression model relating compressive strength to density. b) Test for significance of regression with a = 0.05. C) Estimate of for this model. d) Calculate R2 for this model. Provide an interpretation of this quantity. f) e) Find 95% confidence interval for slope, intercept, prediction when density = 40 Prepare a normal probability plot of the residuals and interpret this display. g) Plot the residuals versus y and versus x. Does the assumption of constant variance seem to be satisfied? 6. EXTRA CREDIT (1 point) Use R's nonlinear regression function to determine if there is a nonlinear model that is better than the linear model. To do so, select four (4) different nonlinear models and test them using the nonlinear regression function. (Select four that seem reasonable, either manually entered or chosen from the catalog.) Compute the R-Sq value of the model. (You'll have to do this by hand using the definition of R-sq.) Report on the R-Sq values of the nonlinear models as compared to the linear model, and identify the model with the best (highest) R-Sq value. Comprehensive Comprehensive Strength Density Strength Density 304 29.2 3840 2470 30.7 24.7 3800 3610 327 32.3 4600 3480 32.6 31.3 1900 3810 22 1 31.5 2530 25.3 2330 24.5 2920 1800 30.8 19.9 3990 3110 38.9 27.3 1670 3160 22 1 27.1 $310 2310 29.2 24.0 3450 4360 30.1 33.8 3600 1880 31.4 21.5 2850 3670 26.7 32.2 1590 1740 22. 22.5 3770 2250 30.3 27.5 3850 2650 32.0 25.6 2480 4970 23.2 34.5 3570 2620 30.3 26.2 2620 2900 29.9 26.7 1890 1670 20.8 21.1 3030 2540 33.2 24.1 3030 28.2