Question

1 Approved Answer

Posted on Oct 13, 2024

STAT*2040 Statistics I G. Umphrey Graded Assignment #3 Fall 2016 This assignment will demonstrate how R can be used to very efficiently obtain analyses of

STAT*2040 Statistics I G. Umphrey Graded Assignment #3 Fall 2016 This assignment will demonstrate how R can be used to very efficiently obtain analyses of certain statistical methods that we cover near the end of Stat*2040, namely one-way ANOVA and simple linear regression analysis. This assignment has been designed to deliver high educational value in a reasonably short time. Instructions on what to submit via Crowdmark will be sent to your University of Guelph email account. All of the data sets are stored as csv (comma separated values) in Excel and are ready to use after you download them. The data sets come from Hand et al's A Handbook of Small Data Sets. PART A: A Bit on t Tests One-sample, two-sample and paired t tests are not that difficult to conduct by hand (i.e. with a calculator) for small data sets but using a computer reduces the chance of making an error, calculates pvalues, is certainly easier for larger data sets, and provides very accessible graphical methods. A nice summary of the t tests in R is provided by Robert I. Kabacoff on his excellent Quick-R website; you can find this readily by googling \"Quick-R t tests\" or through the URL http://www.statmethods.net/stats/ttest.html The t-test for two independent samples can analyze data that is structured in either \"unstacked\" or \"stacked\" formats. In the \"unstacked\" data format the samples are side-by-side, this is the way you most commonly see a data table in a book. In the \"stacked\" data format each observation is on a separate line so that the response variable values are in a single column, and another column codes which group each observation belongs to. The paired t-test only takes data in the unstacked format. An example of a paired data experiment is introduced by Hand et al as follows: \"People have higher levels of beta-endorphin in the blood under conditions of emotional stress. For 19 patients scheduled to undergo surgery, blood samples were taken (a) 12!14 hours before surgery and (b) 10 minutes before surgery. Beta-endorphin levels were measured in fmol/ml (a femto-mole is 10!15 grams times the molecular weight of the subject.)\" Load the \"Presurgical Stress\" data set into R and conduct the required paired t-test to test if people do indeed produce higher levels of beta-endorphin in the blood just before surgery. To load the data set into R from your computer use the command: Blood <- read.csv(file.choose()) Of course you can use a data frame name other than \"Blood\" if you wish. You can check the first six lines with head(Blood), the last six lines with tail(Blood), or the entire data set by simply entering \"Blood\". I recommend you attach your data frame with attach(Blood); remember to detach when you are finished with it if you want to move on to another analysis with a different data frame in the same R session. Now run the paired t-test analysis in R. Obtain the value of the t test statistic and the p-value, exactly as they are reported on the R output. Be careful in your analysis; note that this is a one-sided test. What do you conclude at the 5% level of significance? Also obtain a (two-sided) 95% confidence interval for the (true) mean difference. PART B: One-Way ANOVA The purpose of a one-way ANOVA is to test a null hypothesis that the population means of two or more groups are equal (groups are often called \"treatments\" when the groups are defined by the application of different treatments). The test statistic to conduct the test has an F distribution if the null hypothesis is true, but the value of the F test statistic tends to be inflated if the alternative hypothesis is true. The F test statistic is calculated via an ANOVA table. If the null hypothesis is rejected at the specified significance level and there are three or more groups being compared, we typically want to know where the differences are. The Tukey HSD (Honestly Significant Difference) procedure is one such procedure for doing so, and the easiest one to run in R. Hand et al introduce the data as follows: \"The data are steady-state haemoglobin levels for patients with different types of sickle cell disease, these being HB SS, HB S/-thalassaemia and HB SC. One question of interest is whether the steady state haemoglobin levels differ significantly between patients with different types.\" The data is provided in both unstacked and stacked formats on our Courselink website. The unstacked version is nice to look at but the stacked version is required for the analysis. Read the data set into R: > Sickle <- read.csv(file.choose()) Remember to use the stacked version. Check out the first six and last six observations with head(Sickle) and tail(Sickle) commands. Now attach the data frame: > attach(Sickle) B1. Use R to obtain the ANOVA table for testing the null hypothesis that the populations of the three sickle cell disease types have equal mean steady-state haemoglobin levels. The key commands are: > SickleMod1 <- aov(Haemoglobin ~ SickleCellType) > anova(SickleMod1) B2. What is the value of the Total SS? B3. What does the p-value for the F test statistic allow you to conclude at the 5% level of significance? B4. If appropriate (and it will be here), use R to conduct Tukey's HSD test at the 5% level of significance. What conclusions can you make based on this analysis? The key commands are: > TukeyHSD(SickleMod1, ordered = T) The \"ordered = T\" is an option that makes the output a bit easier to interpret. You will want to get the SickleCellType means. Two ways to do this are: > tapply(Haemoglobin, SickleCellType, mean) or > model.tables(SickleMod1, \"means\") B5. Obtain side-by side boxplots with > boxplot(Haemoglobin ~ SickleCellType) You can enhance the boxplots with options, if you wish. B6. On Graded Assignment #1, Part B, you made side-by-side boxplots on the \"Popes, presidents and monarchs\" data set. Repeat the analyses of B1 to B5 on this data set. Report the value of the F test statistic and the associated degrees of freedom and p-value for testing the equality of population means for the three \"head of state\" types. What overall conclusions can you make after conducting (if necessary) Tukey's HSD test? (Be concise, two to four sentences is sufficient to answer all parts of this question.) Part C: Simple Linear Regression Analysis The last topic we cover in lectures is simple linear regression and correlation analysis. Simple linear regression analysis obtains a \"best fitting\" straight line to a set of bivariate data, where one of the variables is treated as the independent (or predictor) variable, the other variable is treated as the dependent (or response) variable and the criterion of least squares defines \"best fit\". The independent variable is commonly designated X and the dependent variable Y, reflecting the fact that the (x, y) data are plotted with the x values on the horizontal (X) axis and the y values are plotted on the vertical (Y) axis. A graph of the (x, y) observations in a data set is called a scatterplot; on this it is common to plot a fitted regression model. We restrict our fitted models in this course to straight lines, but you should be aware that many nonlinear models can also be fit (a topic for Stat*2050, Stat*3240 and beyond). In this exercise we will fit a couple of simple linear regression equation in R; this will provide an example for fitting other simple linear regression models. The data set we will use is \"Hand_048 Aflatoxin in peanuts\". Hand et al introduce the data set as follows: \"The data give, for 34 batches of peanuts, the average level of aflatoxin (parts per billion) in a mini-lot sample of 120 pounds of peanuts (X) and the percentage of noncontaminated peanuts in the batch (Y). The aim is to investigate the relationship between the two variables, and to predict Y from X.\" The variables X and Y have been relabelled \"Aflatoxin\" and \"NonContaminated\" on the data set provided; you can rename these if you wish. Open R and read the data frame into R; I would suggest > Peanuts <- read.csv(file.choose()) To check your data set enter: > head(Peanuts) and > tail(Peanuts) Note that the data frame contains 34 observations, with each observation on a separate line. An observation consists of a pair of (Aflatoxin, NonContaminated) values, corresponding in this case to (X, Y). To use these variable names without specifying the dataframe each time we need to \"attach\" the dataframe: > attach(Peanuts) We can make a scatterplot of the data at this point. But if we also want to fit the least squares regression line, we first need to run a \"linear model\" analysis. Let's regress NonContaminated on Aflatoxin: > PeanutsMod1 <- lm(NonContaminated ~ Aflatoxin) If we had not attached the dataframe, we would have typed > PeanutsMod1 <- lm(NonContaminated ~ Aflatoxin, data = Peanuts) so that the lm() function would know where to find the variables Aflatoxin and NonContaminated. To get a bare-bones scatterplot, type: > plot(NonContaminated ~ Aflatoxin) or > plot(Aflatoxin, NonContaminated) Now superimpose the least squares regression line by typing: > abline(PeanutsMod1) Enhancements can be made to the graph, for example we can add a title and better label the axes by adding options to the plot() statement: > + + + plot(NonContaminated ~ Aflatoxin, main = \"Aflatoxin in Peanuts\qattachments_2a4e728b29f0c525d458e2f1d974b429af64eead Aflatoxin NonContaminated 3 99.971 4.7 99.979 8.3 99.982 9.3 99.971 9.9 99.957 11 99.961 12.3 99.956 12.5 99.972 12.6 99.889 15.9 99.961 16.7 99.982 18.8 99.975 18.8 99.942 18.9 99.932 21.7 99.908 21.9 99.97 22.8 99.985 24.2 99.933 25.8 99.858 30.6 99.987 36.2 99.958 39.8 99.909 44.3 99.859 46.8 99.863 46.8 99.811 58.1 99.877 62.3 99.798 70.6 99.855 71.1 99.788 71.3 99.821 83.2 99.83 83.6 99.718 99.5 99.642 111.2 99.658 Page 1 qattachments_0ddb65cb5437066a787e5703f5e3d61c265546e4 Haemoglobin SickleCellType 7.2 HB.SS 7.7 HB.SS 8 HB.SS 8.1 HB.SS 8.3 HB.SS 8.4 HB.SS 8.4 HB.SS 8.5 HB.SS 8.6 HB.SS 8.7 HB.SS 9.1 HB.SS 9.1 HB.SS 9.1 HB.SS 9.8 HB.SS 10.1 HB.SS 10.3 HB.SS 8.1 HB.S-th 9.2 HB.S-th 10 HB.S-th 10.4 HB.S-th 10.6 HB.S-th 10.9 HB.S-th 11.1 HB.S-th 11.9 HB.S-th 12 HB.S-th 12.1 HB.S-th 10.7 HB.SC 11.3 HB.SC 11.5 HB.SC 11.6 HB.SC 11.7 HB.SC 11.8 HB.SC 12 HB.SC 12.1 HB.SC 12.3 HB.SC 12.6 HB.SC 12.6 HB.SC 13.3 HB.SC 13.3 HB.SC 13.8 HB.SC 13.9 HB.SC Page 1 qattachments_56e24c884dc818fd93b03ee4ec2ace5d3ee89cf4 HB.SS HB.S-th HB.SC 7.2 8.1 10.7 7.7 9.2 11.3 8 10 11.5 8.1 10.4 11.6 8.3 10.6 11.7 8.4 10.9 11.8 8.4 11.1 12 8.5 11.9 12.1 8.6 12 12.3 8.7 12.1 12.6 9.1 12.6 9.1 13.3 9.1 13.3 9.8 13.8 10.1 13.9 10.3 Page 1 qattachments_be62adeb616915b62d62b6191d0f83019aee3674 Type Calories Sodium Beef 186 495 Beef 181 477 Beef 176 425 Beef 149 322 Beef 184 482 Beef 190 587 Beef 158 370 Beef 139 322 Beef 175 479 Beef 148 375 Beef 152 330 Beef 111 300 Beef 141 386 Beef 153 401 Beef 190 645 Beef 157 440 Beef 131 317 Beef 149 319 Beef 135 298 Beef 132 253 Meat 173 458 Meat 191 506 Meat 182 473 Meat 190 545 Meat 172 496 Meat 147 360 Meat 146 387 Meat 139 386 Meat 175 507 Meat 136 393 Meat 179 405 Meat 153 372 Meat 107 144 Meat 195 511 Meat 135 405 Meat 140 428 Meat 138 339 Poultry 129 430 Poultry 132 375 Poultry 102 396 Poultry 106 383 Poultry 94 387 Poultry 102 542 Poultry 87 359 Poultry 99 357 Poultry 107 528 Poultry 113 513 Poultry 135 426 Poultry 142 513 Poultry 86 358 Poultry 143 581 Poultry 152 588 Page 1 qattachments_be62adeb616915b62d62b6191d0f83019aee3674 Poultry Poultry 146 144 522 545 Page 2 1 Summary Sheet for Part A The value of the test statistic is ______________________ Associated degrees of freedom: __________________ The p-value is ___________________ What can you conclude at the 5% level of significance? The 95% confidence interval is _______________________________ Summary Sheet for Part B B2. The value of the Total SS is ___________________ B3. The p-value is ____________________ What can you conclude at the 5% level of significance? B4. What conclusions can be made based on the Tukey's HSD test? B6. For the \"Popes, presidents and monarchs\" data set: The value of the F test statistic is ________________ The degrees of freedom associated with the F test statistic are ____________________ The p-value for the value of the F test statistic is ___________________ What conclusions can be made based on the Tukey's HSD test? Summary Sheet for Part C C1. What is the least squares regression equation? C2. Give the confidence intervals for the intercept and the slope. Show brief work. C3. The predicted value of NonContaminated for Aflatoxin = 80 is _____________________ C4. The value of the residual for the last observation in the data set is ____________________ C5. The value of the correlation coefficient is ___________________ C6. The value from the ANOVA table that would be obtained is the _______________________ C7. There are ____________ outliers What evidence of skewness in the residuals is there and what does it mean? C8. The first t-statistic is __________________ What does this value allow us to conclude at the 5% level of significance? The second t-statistic is __________________ What does this value allow us to conclude at the 5% level of significance? C9. Briefly explain how one of the t-tests in C8 is equivalent to the F test performed here. C10. a) How does the estimated regression equation change? b) How does the correlation coefficient change? c) Are the hypothesis tests for testing for a statistically significant regression affected? d) Do you prefer to use NonContaminated or Contaminated as the response variable? Why? The above two are for 1st model i.e. for NonContaminated qattachments_5a271624aca394de417d250610feede3fb7fea48 Type Years President 10 President 29 President 26 President 28 President 15 President 23 President 17 President 25 President 0 President 20 President 4 President 1 President 24 President 16 President 12 President 4 President 10 President 17 President 16 President 0 President 7 President 24 President 12 President 4 President 18 President 21 President 11 President 2 President 9 President 36 President 12 President 28 President 3 President 16 President 9 Pope 2 Pope 9 Pope 21 Pope 3 Pope 6 Pope 10 Pope 18 Pope 11 Pope 6 Pope 25 Pope 23 Pope 6 Pope 2 Pope 15 Pope 32 Pope 25 Pope 11 Page 1 qattachments_5a271624aca394de417d250610feede3fb7fea48 Pope Pope Pope Pope Pope Pope Monarch Monarch Monarch Monarch Monarch Monarch Monarch Monarch Monarch Monarch Monarch Monarch Monarch Monarch 8 17 19 5 15 0 17 6 13 12 13 33 59 10 7 63 9 25 36 15 Page 2 qattachments_705b4535d1d348420c0091a827ff9443163bc6b6 Presidents Popes Monarchs 10 2 17 29 9 6 26 21 13 28 3 12 15 6 13 23 10 33 17 18 59 25 11 10 0 6 7 20 25 63 4 23 9 1 6 25 24 2 36 16 15 15 12 32 4 25 10 11 17 8 16 17 0 19 7 5 24 15 12 0 4 18 21 11 2 9 36 12 28 3 16 9 Page 1 qattachments_79178f531560d58d522a2b75cd521fcc2370a241 Patient A B 1 10 6.5 2 6.5 14 3 8 13.5 4 12 18 5 5 14.5 6 11.5 9 7 5 18 8 3.5 42 9 7.5 7.5 10 5.8 6 11 4.7 25 12 8 12 13 7 52 14 17 20 15 8.8 16 16 17 15 17 15 11.5 18 4.4 2.5 19 2 2 Page 1 The above two are for 1st model i.e. for NonContaminated