Answered step by step
Verified Expert Solution
Question
1 Approved Answer
c23TwoCategoricalVariablesTheChi570 Page 570 10/4/11 5:21:55 PM ff-446 570 CHAP TER 23 /Users/ff-446/Desktop/4:10:2011 Two Categorical Variables: The Chi-Square Test A P P LY Y O U
c23TwoCategoricalVariablesTheChi570 Page 570 10/4/11 5:21:55 PM ff-446 570 CHAP TER 23 /Users/ff-446/Desktop/4:10:2011 Two Categorical Variables: The Chi-Square Test A P P LY Y O U R K N O W L E D G E 23.10 Cell-only versus landline users. We suspect that people who rely entirely on cell phones will as a group be younger than those who have a landline telephone. Do data confirm this guess? Here is a two-way table that breaks down both of Pew's CELLAGE samples (see Example 23.6) by age group: Age (years) Landline sample Cell-only sample 18-29 104 96 30-49 265 70 50-64 204 26 65 or older 179 8 Total 752 200 Do a complete analysis of these data, following the four-step process as illustrated in Example 23.6. THE CHI-SQUARE DISTRIBUTIONS Software usually finds P-values for us. The P-value for a chi-square test comes from comparing the value of the chi-square statistic with critical values for a chisquare distribution. THE CHI-SQUARE DISTRIBUTIONS The chi-square distributions are a family of distributions that take only positive values and are skewed to the right. A specific chi-square distribution is specified by giving its degrees of freedom. The chi-square test for a two-way table with r rows and c columns uses critical values from the chi-square distribution with (r \u0005 1)(c \u0005 1) degrees of freedom. The P-value is the area under the density curve of this chi-square distribution to the right of the value of test statistic. Figure 23.7 shows the density curves for three members of the chi-square family of distributions. As the degrees of freedom increase, the density curves become less skewed and larger values become more probable. Table D in the back of the book gives critical values for chi-square distributions. You can use Table D if you do not have software that gives you P-values for a chi-square test. c23TwoCategoricalVariablesTheChi577 Page 577 10/4/11 5:22:01 PM ff-446 /Users/ff-446/Desktop/4:10:2011 S TAT I S T I C S I N S U M M A RY Here are the most important skills you should have acquired from reading this chapter. A. Two-Way Tables 1. Understand that the data for a chi-square test must be presented as a two-way table of counts of outcomes. 2. Use percents to describe the relationship between any two categorical variables, starting from the counts in a two-way table. B. Interpreting Chi-Square Tests 1. Locate the chi-square statistic, its P-value, and other useful facts (row or column percents, expected counts, terms of chi-square) in output from your software or calculator. 2. Use the expected counts to check whether you can safely use the chi-square test. 3. Explain what null hypothesis the chi-square statistic tests in a specific two-way table. 4. If the test is significant, compare percents, compare observed with expected cell counts, or look for the largest terms of the chi-square statistic to see what deviations from the null hypothesis are most important. C. Doing Chi-Square Tests by Hand 1. Calculate the expected count for any cell from the observed counts in a two-way table. Check whether you can safely use the chi-square test. 2. Calculate the term of the chi-square statistic for any cell, as well as the overall statistic. 3. Give the degrees of freedom of a chi-square statistic. Make a quick assessment of the significance of the statistic by comparing the observed value with the degrees of freedom. 4. Use the chi-square critical values in Table D to approximate the P-value of a chisquare test. LINK IT Part IV of the text studies relationships between variables. Relationships between two quantitative variables were introduced in Chapters 4 and 5, and these will be described in greater detail in the next chapter. In this chapter, the case of two categorical variables is considered, and a formal test for answering the question \"Is there a relationship between the two categorical variables?\" is developed. As with procedures described in earlier chapters, we must first consider how the data were produced, as this plays an important role in the conclusions we can reach. Were the data produced by an experiment or an observational study? If it is an observational study, are there lurking variables that can explain the observed relationship? In addition, we should begin with data analysis rather than a formal test. In the case of two-way tables, this typically involves looking at conditional distributions, both numerically and graphically, in order to first understand the nature of the relationship. When considering the relationship between Link It 57 7 c23TwoCategoricalVariablesTheChi578 Page 578 10/4/11 5:22:01 PM ff-446 578 CHAP TER 23 /Users/ff-446/Desktop/4:10:2011 Two Categorical Variables: The Chi-Square Test the age of young adults and their living arrangements in Example 23.1, we can see from our data analysis that as young adults age from 19 to 22, the percent living with their parents drops as the percent living in their own place rises. Even though there appears to be a clear relationship between age and living arrangement in Example 23.1, we must still determine whether the observed differences are large enough to be statistically significant. The chi-square test can be used for this, but it is an approximate procedure, and the conditions for cell sizes need to be checked before applying the test. If the differences are statistically significant, the chi-square test, unlike some of the simpler procedures in earlier chapters, tells us only that there is evidence of a relationship, not the nature of the relationship. Although there are formal statistical procedures to further investigate the nature of the relationship, at this point we need to be satisfied with describing the relationship between the two categorical variables using our data analysis tools. CHECK YOUR SKILLS Resistance training is a popular form of conditioning aimed at enhancing sports performance and is widely used among high school, college, and professional athletes, although its use for younger athletes is controversial. A random sample of 4111 patients between the ages of 8 and 30 admitted to U.S. emergency rooms with the injury code \"weightlifting\" was obtained. These injuries were classified as \"accidental\" if caused by dropped weight or improper equipment use. The patients were also classified CORBIS/Superstock into the four age categories \"8-13,\" \"14-18,\" \"19-22,\" and \"23-30.\" Here is a two-way table of the results:10 Age Accidental Not accidental 8-13 295 102 14-18 655 916 19-22 239 533 23-30 363 1008 23.18 The number of \"accidental\" injuries in the sample is (a) 1552. (b) 2559. (c) 4111. 23.19 The percent of the 14- to 18-year-olds in the sample whose injuries were classified as \"accidental\" is WEIGHTLIFTING about (a) 42.2%. (b) 41.7%. (c) 74.3%. 23.20 The percent of the 14- to 18-year-olds in the sample whose injuries were classified as \"accidental\" is (a) higher than the percent for 23- to 30-year-olds. (b) about the same as the percent for 23- to 30-year-olds. (c) lower than the percent for 23- to 30-year-olds. 23.21 The expected count of 14- to 18-year-olds whose injuries were classified as \"accidental\" is about (a) 593.09. (b) 655. (c) 977.91. 23.22 The term in the chi-square statistic for the cell of 14to 18-year-olds whose injuries were classified as \"accidental\" is about (a) 593.09. (b) 3.919. (c) 6.463. 23.23 The degrees of freedom for the chi-square test for this two-way table are (a) 3. (b) 4. (c) 8. 23.24 The null hypothesis for the chi-square test for this twoway table is (a) The proportions of \"Accidental\" and \"Not accidental\" injuries are the same. (b) There is no difference in the probabilities of an \"accidental\" injury for each of the four age groups. (c) \"Accidental\" injuries are more likely for the younger age groups. c23TwoCategoricalVariablesTheChi579 Page 579 10/4/11 5:22:01 PM ff-446 /Users/ff-446/Desktop/4:10:2011 23.25 The alternative hypothesis for the chi-square test for this two-way table is (a) The proportions of \"Accidental\" and \"Not accidental\" injuries are different. (b) The probabilities of an \"accidental\" injury for each of the four age groups are not the same. (c) \"Accidental\" injuries are more likely for the younger age groups. 23.26 Software gives chi-square statistic x2 \u0002 325.459 for this table. From the table of critical values, we can say that the P-value is Chapter 23 Exercises 57 9 (a) between 0.0025 and 0.001. (b) between 0.001 and 0.0005. (c) less than 0.0005. 23.27 The most important fact that allows us to trust the results of the chi-square test is that (a) the sample is large, 4111 weight-lifting injuries in all. (b) the sample is close to an SRS of all weight-lifting injuries. (c) all the cell counts are greater than 100. CHAPTER 23 EXERCISES If you have access to software or a graphing calculator, use it to speed your analysis of the data in these exercises. Exercises 23.28 to 23.33 are suitable for hand calculation if necessary. 23.28 Smoking cessation. A large randomized trial was conducted to assess the efficacy of Chantix for smoking cessation compared with bupropion (more commonly known as Wellbutrin or Zyban) and a placebo. Chantix is different from most other quit-smoking products in that it targets nicotine receptors in the brain, attaches to them, and blocks nicotine from reaching them, while bupropion is an antidepressant often used to help people stop smoking. Generally healthy smokers who smoked at least 10 cigarettes per day were assigned at random to take Chantix (n \u0002 352), bupropion (n \u0002 329), or a placebo (n \u0002 344). The study was double-blind, with the response measure being continuous cessation from smoking for Weeks 9 through 12 of the study. SMOKECESS Here is a two-way table of the results:11 Treatment Chantix Bupropion Placebo No smoking in Weeks 9-12 155 97 61 Smoked in Weeks 9-12 197 232 283 (a) Give a 95% confidence interval for the difference between the proportions of smokers in the bupropion and placebo groups who did not smoke in Weeks 9 through 12 of the study. (b) What proportion of each of the three groups in the sample did not smoke in Weeks 9 through 12 of the study? Are there statistically significant differences among these proportions? State hypotheses and give a test statistic and its P-value. (c) Is this an observational study or an experiment? Why does this make a difference in the type of conclusion we can draw? 23.29 Attitudes toward recycled products. Some peo- ple think recycled products are lower in quality than other products, a fact that makes recycling less practical. Here are data on attitudes toward coffee filters made of recycled RECYCLING paper.12 Think the quality of the recycled product is Higher Same Lower Buyers 20 7 9 Nonbuyers 29 25 43 (a) Find the conditional distributions of opinions on the quality of recycled products for buyers and nonbuyers. Make a graph that compares the two conditional distributions. Use your work to describe the overall relationship between people who have and haven't bought recycled filters and their opinions on the quality of recycled products. (b) Do buyers and nonbuyers of recycled filters differ significantly in their opinions on the quality of recycled products? State hypotheses, give the chi-square statistic and its P-value, and state your conclusion. (c) Association does not prove causation. Explain how buying recycled filters might improve a person's opinion of their quality. Then explain how the opinion a person holds might influence his or her decision to buy or not. You see that the cause-and-effect relationship might go in either direction. 23.30 Do you use cocaine? Sample surveys on sensitive issues can give different results depending on how the question is asked. A University of Wisconsin study divided 2400 respondents into three groups at random. All were asked if they had ever used cocaine. One group of 800 was interviewed by phone; 21% said they had used cocaine. Another 800 people were asked the question in a one-on-one personal interview; 25% said \"Yes.\" c23TwoCategoricalVariablesTheChi580 Page 580 10/4/11 5:22:02 PM ff-446 580 CHAP TER 23 Two Categorical Variables: The Chi-Square Test The remaining 800 were allowed to make an anonymous written response; 28% said \"Yes.\"13 Are there statistically significant differences among these proportions? State the hypotheses, convert the information given into a two-way table of counts, give the test statistic and its P-value, and state your conclusions. 23.31 Did the randomization work? After randomly assigning subjects to treatments in a randomized comparative experiment, we can compare the treatment groups to see how well the randomization worked. We hope to find no significant differences among the groups. A study of how to provide premature infants with a substance essential to their development assigned infants at random to receive one of four types of supplement, called PBM, NLCP, PL-LCP, and TG-LCP.14 (a) The subjects were 77 premature infants. Outline the design of the experiment if 20 are assigned to the PBM group and 19 to each of the other treatments. (b) The random assignment resulted in 9 females in the TGLCP group and 11 females in each of the other groups. Make a two-way table of group by gender and do a chi-square test to see if there are significant differences among the groups. What do you find? 23.32 More on video-gaming. The data for comparing two sample proportions can be presented in a two-way table containing the counts of successes and failures in both samples, with two rows and two columns. In Exercise 23.2, a survey of the consequences of video-gaming on 14- to 18year-olds is described. Another question from the survey was about aggressive behavior as evidenced by getting into serious fights, and the comparison was between girls that have and have not played video games. Here are the data: Serious Fights Yes Played games Never played games /Users/ff-446/Desktop/4:10:2011 No 36 55 578 1436 (a) Is there evidence that the proportions of all 14- to 18year-old girls who played or have never played video games and have gotten into serious fights differ? Find the two sample proportions, the z statistic, and its P-value. (b) Is there evidence that the proportions of 14- to 18-yearold girls who have or have not gotten into serious fights differ between those who have played or have never played video games? Find the chi-square statistic x2 and its P-value. (c) Show that (up to roundoff error) your x2 is the same as z2. The two P-values are also the same. These facts are always true, so you will often see chi-square for 2 \u0004 2 tables used to compare two proportions. (d) Suppose that we are interested in finding out if the data give good evidence that video-gaming is associated with increased aggression in girls as evidenced by getting into serious fights. Can we use the z test for this hypothesis? What about the x2 test? What is the important difference between these two procedures? 23.33 Unhappy rats and tumors. Some people think that the attitude of cancer patients can influence the progress of their disease. We can't experiment with humans, but here is a rat experiment on this theme. Inject 60 rats with tumor cells and then divide them at random into two groups of 30. All the rats receive electric shocks, but rats in Group 1 can end the shock by pressing a lever. (Rats learn this sort of thing quickly.) The rats in Group 2 cannot control the shocks, which presumably makes them feel helpless and unhappy. We suspect that the rats in Group 1 will develop fewer tumors. The results: 11 of the Group 1 rats and 22 of the Group 2 rats developed tumors.15 (a) Make a two-way table of tumors by group. State the null and alternative hypotheses for this investigation. (b) Although we have a two-way table, the chi-square test can't test a one-sided alternative. Carry out the z test and report your conclusion. 23.34 I think I'll be rich by age 30. A sample survey asked young adults (aged 19 to 25), \"What do you think are the chances you will have much more than a middleclass income at age 30?\" The Minitab output in Figure 23.8 shows the two-way table and related information, omitting a few subjects who refused to respond or who said they were already rich.16 Use the output as the basis for a discussion of the differences between young men and young women in assessing RICHBY30 their chances of being rich by age 30. 23.35 Sexy magazine ads? Look at full-page ads in magazines with a young adult readership. Classify ads that show a model as \"not sexual\" or \"sexual\" depending on how the model is dressed (or not dressed). Here are data on 1509 ads in magazines aimed at young men only, at young SEXYADS women only, or at young adults in general:17 Readers Ad type Men Women General Sexual 105 225 66 Not sexual 514 351 248 Figure 23.9 displays Minitab chi-square output. Use the information in the output to describe the relationship between the target audience and the sexual content of ads in magazines for young adults. c23TwoCategoricalVariablesTheChi581 Page 581 10/4/11 5:22:03 PM ff-446 /Users/ff-446/Desktop/4:10:2011 Chapter 23 Exercises 581 FIGU R E 2 3 . 8 Female Male 96 95.2 0.0076 98 98.8 0.0073 426 349.2 16.8842 286 362.8 16.2525 C: A 50 50 chance 696 694.5 0.0032 720 721.5 0.0031 D: A good chance 663 697.0 1.6543 758 724.0 1.5924 E: Almost certain 486 531.2 3.8424 597 551.8 3.6986 A: Almost no chance B: Some chance but probably not Cell Contents: Minitab output for the sample survey responses of Exercise 23.34. Count Expected count Contribution to Chi-square Pearson Chi-Square = 43.946, DF = 4, P-Value = 0.000 FIGU R E 2 3 . 9 Sexual Not sexual All Men Women General All 105 16.96 162.4 20.312 225 39.06 151.2 36.074 66 21.02 82.4 3.265 396 26.24 396.0 * 514 83.04 456.6 7.227 351 60.94 424.8 12.835 248 78.98 231.6 1.162 1113 73.76 1113.0 * 619 100.00 619.0 * 576 100.00 576.0 * 314 100.00 314.0 * 1509 100.00 1509.0 * Cell Contents: Count % of Column Expected count Contribution to Chi-square Pearson Chi-Square = 80.874, DF = 2, P-Value = 0.00 Minitab output for a study of ads in magazines, for Exercise 23.35. c23TwoCategoricalVariablesTheChi582 Page 582 10/4/11 5:22:05 PM ff-446 582 CHAP TER 23 Two Categorical Variables: The Chi-Square Test Mistakes in using the chi-square test are unusually common. Exercises 23.36 to 23.39 illustrate several kinds of mistake. 23.36 Sorry, no chi-square. An experimenter hid a toy from a dog behind either Screen A or Screen B. In the first phase the toy was always hidden behind Screen A, while in the second phase the toy was always hidden behind Screen B. Will the dog continue to look behind Screen A in the second phase? This was tried under three conditions. In the Social-Communicative condition the experimenter communicated with the dog by establishing eye contact and addressing the dog while hiding the toy; in the Noncommunicative condition the toy was hidden without communication; and in the Nonsocial condition the toy was dragged by a string so that it could be hidden without any interaction from the experimenter. There were 12 dogs assigned at random to each condition, and each dog had up to three trials to find the toy hidden behind Screen B in Phase 2. An error occurred if the dog continued to search behind Screen A. The number of errors ranged from 0 if the dog found the toy behind Screen B on the initial trial up to 3 if the dog never correctly chose HIDDENTOY Screen B. Here are the data:18 Number of Errors Condition 0 1 2 3 Social-Communicative 0 3 3 6 Noncommunicative 5 3 1 3 Nonsocial 8 2 2 0 (a) The data do show a difference in the number of errors for the different conditions. Show this by comparing suitable percents. (b) The researchers used a more complicated but exact procedure rather than chi-square to assess significance for these data. Why can't the chi-square test be trusted in this case? (c) If you use software, does the chi-square output for these data warn you against using the test? 23.37 Sorry, no chi-square. How do U.S. residents who travel overseas for leisure differ from those who travel for business? Here is the breakdown by occupation:19 Occupation /Users/ff-446/Desktop/4:10:2011 Leisure travelers Business travelers Professional/technical 36% 39% Manager/executive 23% 48% Retired 14% 3% Student 7% 3% Other 20% 7% Total 100% 100% Explain why we don't have enough information to use the chi-square test to learn whether these two distributions differ significantly. 23.38 Sorry, no chi-square. Here is more information about Internet use by students at Penn State, based on a random sample of 1852 undergraduates. Explain why it is not correct to use a chi-square test on this table to compare the University Park and commonwealth campuses. Note that in order to use the chi-square test in a two-way table, each individual must fall into one cell of the table. Internet use University Park Commonwealth Viewed a video on YouTube or similar site 875 700 Legally purchased music or videos online 514 348 Downloaded a podcast 235 145 Participated in Internet gambling 114 93 23.39 Sorry, no chi-square. Does eating chocolate trigger headaches? To find out, women with chronic headaches followed the same diet except for eating chocolate bars and carob bars that looked and tasted the same. Each subject ate both chocolate and carob bars in random order with at least three days between. Each woman then reported whether or not she had a headache within 12 hours of eating the bar. Here is a two-way table of the results for the 64 subjects:20 Bar No headache Headache Chocolate 53 11 Carob (placebo) 38 26 The researchers carried out a chi-square test on this table to see if the two types of bar differ in triggering headaches. Explain why this test is incorrect. (Hint: There are 64 subjects. How many observations appear in the two-way table?) The remaining exercises concern larger tables that require software for easy analysis. In many cases, you should follow the Plan, Solve, and Conclude steps of the four-step process in your answers. 23.40 Smokers rate their health. The University of Michigan Health and Retirement Study (HRS) surveys more than 22,000 Americans over the age of 50 every two years. A subsample of the HRS participated in the 2009 Internetbased survey that collected information on a number of topical areas, including health (physical and mental, health behaviors), psychosocial items, economics (income, assets, expectations, and consumption), and retirement.21 Two of c23TwoCategoricalVariablesTheChi583 Page 583 10/4/11 5:22:05 PM ff-446 /Users/ff-446/Desktop/4:10:2011 the questions asked on the Internet survey were \"Would you say your health is excellent, very good, good, fair or poor?\" and \"Do you smoke cigarettes now?\" The two-way table sumSMOKERRATING marizes the answers to these two questions. Yes No Excellent 25 484 Very good 115 1557 Good 145 1309 Fair 90 545 Poor 29 11 (a) Regard the HRS Internet sample as approximately an SRS of Americans over the age of 50, and give a 99% confidence interval for the proportion of Americans over the age of 50 who are current smokers. (b) Compare the conditional distributions of self-evaluation of health for current smokers and nonsmokers using both a table and a graph. What are the most important differences? (c) Carry out the chi-square test for the hypothesis of no difference between the self-evaluation of health for current smokers and nonsmokers. What would be the mean of the test statistic if the null hypothesis were true? The value of the statistic is so far above this mean that you can see at once that it must be highly significant. What is the approximate P-value? (d) Look at the terms of the chi-square statistic and compare observed and expected counts in the cells that contribute the most to chi-square. Based on this and your findings in part (b), write a short comparison of the differences in self-evaluation of health for current smokers and nonsmokers. 23.42 Condom usage among high school students. The Centers for Disease Control developed the Youth Risk Behavior Surveillance System (YRBSS) to monitor six categories of priority health risk behaviors among youth: behaviors that contribute to unintentional injuries and violence; tobacco use; alcohol and other drug use; sexual behaviors that contribute to unintended pregnancy and sexually transmitted diseases; unhealthy dietary behaviors; and physical inactivity. A multistage sample design is used to produce representative samples of students in grades 9 to 12, who then fill out a questionnaire on these behaviors. The data below are for the question \"Did Not Use a Condom during Last Sexual Intercourse?\" The two-way table of grade and condom usage includes only students who were currently CONDOM_USE sexually active. Here are the results:22 23.41 Who goes to religious services? The General Fotosearch/Superstock Social Survey (GSS) asked this questions: \"Have you attended religious services in the last week?\" Here are the responses for those whose highest degree was high school or above: High school Condom Used Grade Yes No 9th 300 532 10th 350 736 Highest Degree Held 11th 601 956 Junior college 12th 873 1068 SERVICES Bachelor's Graduate Attended services 400 62 146 76 Did not attend services 880 101 232 105 583 (a) Carry out the chi-square test for the hypothesis of no relationship between the highest degree attained and attendance at religious services in the last week. What do you conclude? (b) Make a 2 \u0004 3 table by omitting the column corresponding to those whose highest degree was high school. Carry out the chi-square test for the hypothesis of no relationship between the type of advanced degree attained and attendance at religious services in the last week. What do you conclude? (c) Make a 2 \u0004 2 table by combining the counts in the three columns that have a highest degree beyond high school, so that you are comparing adults whose highest degree was high school with those whose highest degree was beyond high school. Carry out the chi-square test for the hypothesis of no relationship between attaining a degree beyond high school and attendance at religious services for this 2 \u0004 2 table. What do you conclude? (d) Using the results from these three chi-square tests, write a short report explaining the relationship between attendance at religious services in the last week and the highest degree attained. As part of your report, you should give the percents who attended religious services for each of the four degrees. Current Smoker Health Chapter 23 Exercises Describe the most important differences between condom usage and grade. Is there a significant overall difference between the proportions who used condoms in the different grades? c23TwoCategoricalVariablesTheChi584 Page 584 10/4/11 5:22:06 PM ff-446 584 CHAP TER 23 Two Categorical Variables: The Chi-Square Test 23.43 How are schools doing? The nonprofit group Public Agenda conducted telephone interviews with a stratified sample of parents of high school children. There were 202 black parents, 202 Hispanic parents, and 201 white parents. One question asked was \"Are the high schools in your state doing an excellent, good, fair or poor job, or don't you know enough to HIGHSCHOOLS say?\" Here are the survey results:23 Opinion Black parents Hispanic parents White parents Excellent 12 34 22 Good 69 55 81 Fair 75 61 60 Poor 24 24 24 Don't know Total /Users/ff-446/Desktop/4:10:2011 22 28 14 202 202 201 (a) Is this study an experiment? Explain your answer. (b) Is there a significant difference in the distributions of type of complication for the three types of surgery? Which surgeries have the greatest chance of complications? Can we conclude that it is the surgery that is more dangerous, or could there be other factors associated with the increased risk? 23.45 Market research. Before bringing a new product to market, firms carry out extensive studies to learn how consumers react to the product and how best to advertise its advantages. Here are data from a study of a new laundry detergent.25 The subjects are people who don't currently use the established brand that the new product will compete with. Give subjects free samples of both detergents. After they have tried both for a while, ask which they prefer. The answers may depend LAUNDRY on other facts about how people do laundry. Laundry Practices Soft water, warm wash Soft water, hot wash Hard water, Hard water, Preference warm wash hot wash Are the differences in the distributions of responses for the three groups of parents statistically significant? What Prefer standard 53 27 42 30 departures from the null hypothesis \"no relationship product between group and response\" contribute most to the Prefer new 63 29 68 42 value of the chi-square statistic? Write a brief conclusion product based on your analysis. 23.44 Complications of bariatric surgery. Bariatric surHow do laundry practices (water hardness and wash temgery, or weight-loss surgery, includes a variety of procedures perature) influence the choice of detergent? In which settings performed on people who are obese. Weight loss is achieved does the new detergent do best? Are the differences between by reducing the size of the stomach with an implanted medical the detergents statistically significant? device (gastric banding), through removal of a portion of the Support for political parties. Political parties want to know what stomach (sleeve gastrectomy), or by resecting and rerouting the groups of people support them. The General Social Survey (GSS) small intestines to a small stomach pouch (gastric bypass surgery). asked its 2008 sample, \"Generally speaking, do you usually think of Because there can be complications using any of these methods, yourself as a Republican, Democrat, Independent, or what?\" The GSS the National Institutes of Health recommends bariatric surgery is essentially an SRS of American adults. Here is a large two-way table for obese people with a body mass index (BMI) of at least 40 breaking down the responses by the highest degree the subject held: and for people with a BMI of at least 35 and serious coexisting medical conditions such as diabetes. Serious complications Highest Degree Held include potentially life-threatening, permanently disabling, High Jr. and fatal outcomes. Here is a two-way table for data collected Party support None school college Bachelor's Graduate in Michigan over several years giving counts of non-life63 185 32 56 54 threatening complications, serious complications, and no com- Strong Democrat Not strong Democrat 45 183 30 44 29 BARIATRIC plications for these three types of surgeries:24 Type of Complication Type of surgery Non-lifethreatening Serious None Total Gastric banding 81 46 5253 5380 Sleeve gastrectomy 31 19 804 854 606 325 8110 9041 Gastric bypass Independent, near Democrat 34 132 19 57 20 Independent 87 156 22 31 26 Independent, near Republican 19 78 20 35 10 Not strong Republican 20 147 33 76 27 Strong Republican 25 98 13 43 22 2 16 4 11 5 Other party c23TwoCategoricalVariablesTheChi585 Page 585 11/15/11 5:42:14 PM user-s163 user-F452 Exercises 23.46 to 23.48 are based on this table. 23.46 Other parties. Give a 95% confidence interval for the proportion of adults who are \"Independent.\" 23.47 Party support in brief. Make a 2 5 table by combining the counts in the three rows that mention \"Democrat\" and in the three rows that mention \"Republican\" and ignoring strict independents and supporters of other parties. We might think of this table as comparing all adults who lean Democrat and all adults who lean Republican. How Exploring the Web 585 does support for the two major parties differ among adults with POLPARTYCOMBINE different levels of education? 23.48 Party support in full. Use the full table to analyze the differences in political party support among levels of education. The sample is so large that the differences are bound to be highly significant, but give the chi-square statistic and its P-value nonetheless. The main challenge is in seeing what the data say. Does the full table yield any insights not found in the compressed table you analyzed in the previPOLPARTYFULL ous exercise? EXPLORING THE WEB 23.49 Make your own table. The Behavioral Risk Factor Surveillance System (BRFSS) is an ongoing data collection program designed to measure behavioral risk factors for the adult population (18 years of age or older) living in households. Data are collected from a random sample of adults (one per household) through a telephone survey. Go to the Web site apps.nccd.cdc.gov/BRFSS/ and under BRFSS Contents click on Web Enabled Analysis Tool (WEAT) and then click on Cross Tabulation Analysis. After selecting a year, a window will open that will allow you to produce two-way tables. (a) Choose a state of interest to you and two variables for the two-way table. For example, you could choose Connecticut and look at the relationship between a demographic variable such as education level and a variable such as health care coverage. Once you have chosen your state and two variables, click on run report at the bottom of the page. A two-way table will appear in a new window. (b) Is there a relationship between the two variables you selected? If the relationship is statistically significant, describe the relationship in a brief report using percents from the table and an appropriate graph. 23.50 What do the voters think? The American National Election Studies (ANES) is the leading academically run national survey of voters in the United States and is conducted before and after every presidential election. SDA (Survey Documentation and Analysis) is a set of programs that allows you to analyze survey data and includes the ANES survey as part of its archive. Go to the Web site sda.berkeley.edu/ and click on Archive. Go to the 2008 ANES survey. (a) Open the pre-election survey data. Under Liberal/Conservative, choose the variable \"liberal/conservative self-placement on a 7 point scale.\" Use this as your row variable. Under Issues, choose the variable \"Iraq war increased or decreased the threat of terrorism.\" Use this as your column variable. In the details for the table, set Weight to none, and for N of Cases to Display, make sure the unweighted box is checked. For Percentaging, choose row percents. Now click on \"run the table.\" (b) To analyze the data, make a 3 3 table by combining the rows for extremely liberal and liberal; slightly liberal, middle of the road, and slightly conservative; and conservative and extremely conservative. Carry out a formal test to determine if there is a relationship between these two variables, and then describe the relationship in a brief report using percents from the table or an appropriate graph. (c) Select two other variables of interest to you and analyze the relationship between them. If there is a more recent survey than 2008, you should use it. c24InferenceforRegression.indd Page 613 10/4/11 8:38:16 PM ff-446 /Users/ff-446/Desktop/4:10:2011 CHAPTER 24 S U M M A RY CHAPTER SPECIFICS Least-squares regression fits a straight line to data in order to predict a response variable y from an explanatory variable x. Inference about regression requires more conditions. The conditions for regression inference say that there is a population regression line \u0002y \u0002 \u0004 \u0003 \u0003x that describes how the mean response varies as x changes. The observed response y for any x has a Normal distribution with mean given by the population regression line and with the same standard deviation \u0005 for any value of x. Observations on y are independent. The parameters to be estimated are the intercept \u0004 and the slope \u0003 of the population regression line and also the standard deviation \u0005. The slope a and intercept b of the least-squares line estimate \u0004 and \u0003. Use the regression standard error s to estimate \u0005. The regression standard error s has n \u0004 2 degrees of freedom. All t procedures in regression inference have n \u0004 2 degrees of freedom. To test the hypothesis that the slope is zero in the population, use the t statistic t \u0002 b/SEb. This null hypothesis says that straight-line dependence on x has no value for predicting y. In practice, use software to find the slope b of the least-squares line, its standard error SEb, and the t statistic. The t test for regression slope is also a test for the hypothesis that the population correlation between x and y is zero. To do this test without software, use the sample correlation r and Table E. Confidence intervals for the slope of the population regression line have the form b t*SEb. Confidence intervals for the mean response when x has value x* have the form y t*SEm . Prediction intervals for an individual future response y have a similar form with a larger standard error, y t*SEy. Software often gives these intervals. S TAT I S T I C S I N S U M M A RY Here are the most important skills you should have acquired from reading this chapter. A. Preliminaries 1. Make a scatterplot to show the relationship between an explanatory and a response variable. 2. Use a calculator or software to find the correlation and the equation of the leastsquares regression line. 3. Recognize which type of inference you need in a particular regression setting. B. Inference Using Software Output 1. Explain in any specific regression setting the meaning of the slope \u0003 of the population regression line. 2. Understand software output for regression. Find in the output the slope and intercept of the least-squares line, their standard errors, and the regression standard error. 3. Use that information to carry out tests of H0: \u0003 \u0002 0 and calculate confidence intervals for \u0003. 4. Explain the distinction between a confidence interval for the mean response and a prediction interval for an individual response. 5. If software gives output for prediction, use that output to give either confidence or prediction intervals. Statistics in Summary 613 c24InferenceforRegression.indd Page 615 10/4/11 8:38:17 PM ff-446 /Users/ff-446/Desktop/4:10:2011 Here is part of the Minitab output for regressing selling price on appraised value, along with prediction for a unit with appraised value $800,000: 24.19 Is there significant evidence that selling price increases as appraised value increases? To answer this question, test the hypotheses (a) H0: \u0003 \u0002 0 versus Ha: \u0003 \u0005 0. (b) H0: \u0003 \u0002 0 versus Ha: \u0003 0. (c) H0: \u0004 \u0002 0 versus Ha: \u0004 \u0005 0. 24.20 Minitab shows that the P-value for this test is (a) 0.588. (b) 0.1938. (c) less than 0.001. 24.21 The regression standard error for these data is (a) 0.1938. (b) 156.8. (c) 235.41. 24.22 Confidence intervals and tests for these data use the t distribution with degrees of freedom (a) 28. (b) 27. (c) 26. 24.23 A 95% confidence interval for the population slope \u0003 is (a) 1.2699 0.3306. (b) 1.2699 322.3808. (c) 1.2699 0.3985. 24.24 Louisa owns a unit in this building appraised at $800,000. The Minitab output includes prediction for this appraised value. She can be 95% confident that her unit would sell for between (a) $609,400 and $1,594,500. (b) $1,010,000 and $1,193,900. (c) $1,057,200 and $1,146,6900. S = 235.410 R-Sq = 62.3% R-Sq(adj) = 60.8% Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 1101.9 44.7 (1010.0, 1193.9) (609.4, 1594.5) Exercises 24.16 to 24.24 are based on this information. 24.16 The equation of the least-squares regression line for predicting selling price from appraised value is (a) price \u0002 86.0 \u0003 1.2699 appraised value. (b) price \u0002 1.2699 \u0003 86.0 appraised value. (c) price \u0002 156.8 \u0003 0.1938 appraised value. 24.17 What is the correlation between selling price and app- (b) 0.623 615 (c) the exact increase in the selling price of an individual unit when its appraised value increases by $1000. Predictor Coef SE Coef T P Constant 86.0 156.8 0.55 0.588 Appraisal 1.2699 0.1938 6.55 0.000 raised value? (a) 0.789 Chapter 24 Exercises (c) 0.388 24.18 The slope \u0003 of the population regression line describes (a) the average selling price in a population of units when a unit's appraised value is 0. (b) the average increase in selling price in a population of units when appraised value increases by $1000. 24.25 Genetically engineered cotton. A strain of genetically engineered cotton, know as Bt cotton, is resistant to certain insects, which results in larger yields of cotton. Farmers in northern China have increased the number of acres planted in Bt cotton. Because Bt cotton in resistant to certain pests, farmers have also reduced their use of insecticide. Scientists in China were interested in the long-term effects of Bt cotton cultivation and decreased insecticide use on insect populations that are not affected by Bt cotton. One such insect is the mirid bug. Scientists measured the number of mirid bugs per 100 plants and the proportion of Bt cotton planted at Softdreams/Dreamstime.com CHAPTER 24 EXERCISES 38 locations in northern China for the 12-year period from 1997 and 2008. The scientists reported a regression analysis as follows:8 number of mirid bugs per 100 plants \u0002 0.54 \u0003 6.81 Bt cotton planting proportion r2 \u0002 0.90 P \u0007 0.0001 (a) What does the slope b \u0002 6.81 say about the relation between Bt cotton planting proportion and number of mirid bugs per 100 plants? (b) What does r2 \u0002 0.90 add to the information given by the equation of the least-squares line? (c) What null and alternative hypotheses do you think the P-value refers to? What does this P-value tell you? c24InferenceforRegression.indd Page 616 10/4/11 8:38:18 PM ff-446 616 CHAP TER 24 /Users/ff-446/Desktop/4:10:2011 Inference for Regression (d) Does the large value of r2 and the small P-value indicate that increasing the proportion of acres planted in Bt cotton causes an increase in mirid bugs? Exercise 7.51 (page 195) gives data from a study of the \"gate velocity\" of molten metal that experienced foundry workers choose based on the thickness of the aluminum piston being cast. Gate velocity is measured in feet per second, and the piston wall thickness is in inches. A scatterplot (you need not make one) shows a moderately strong positive linear relationship. Figure 24.13 displays part of the Minitab regression output. Exercises 24.26 to 24.28 analyze these data. 24.26 Casting aluminum: is there a relationship? Figure 24.13 leaves out the t statistics and their P-values. Based on the information in the output, test the hypothesis that there is no straight-line relationship between thickness and gate velocity. State hypotheses, give a test statistic and its approximate P-value, and state your conclusion. 24.27 Casting aluminum: intervals. The output in Figure 24.13 includes prediction for piston wall thickness x* \u0002 0.5 inch. Use the output to give 90% intervals for (a) the slope of the population regression line of gate velocity on piston thickness. Regression Analysis: Veloc versus Thick The regression equation is Veloc = 70.4 + 275 Thick Predictor Constant Thick Coef 70.44 274.78 S = 56.3641 Obs 1 2 3 4 5 6 7 8 9 10 11 12 Thick 0.248 0.359 0.366 0.400 0.524 0.552 0.628 0.697 0.697 0.752 0.806 0.821 SE Coef 52.90 88.18 R-Sq = 49.3% Veloc 123.8 223.9 180.9 104.8 228.6 223.8 326.2 302.4 145.2 263.1 302.4 302.4 Fit 138.6 169.1 171.0 180.3 214.4 222.1 243.0 262.0 262.0 277.1 291.9 296.0 T P R-Sq(adj) = 44.2% SE Fit 32.8 24.8 24.3 22.2 16.8 16.4 17.0 19.7 19.7 22.8 26.4 27.4 Residual St Resid -0.32 -14.8 1.08 54.8 0.19 9.9 -1.46 -75.5 0.26 14.2 0.03 1.7 1.55 83.2 0.77 40.4 -2.21R -116.8 -0.27 -14.0 0.21 10.5 0.13 6.4 R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs 1 Fit 207.8 SE Fit 17.4 90% CI (176.2, 239.4) 90% PI (100.9, 314.8) F IGURE 24.1 3 Minitab output for the regression of gate velocity on piston thickness in casting aluminum parts, for Exercises 24.26 to 24.28. c24InferenceforRegression.indd Page 617 10/4/11 8:38:19 PM ff-446 /Users/ff-446/Desktop/4:10:2011 (b) the average gate velocity for a type of piston with thickness 0.5 inch. 24.28 Casting aluminum: residuals. The output in Figure 24.13 includes a table of the x and y variables, the fitted values y for each x, the residuals, and some related ALUMINUMRES quantities. (a) Plot the residuals against thickness (the explanatory variable). Use vertical scale \u0004200 to 200 so that the pattern is clearer. Add the \"residual \u0002 0\" line. Does your plot show a systematically nonlinear relationship? Does it show systematic change in the spread about the regression line? (b) Make a histogram of the residuals. Minitab identifies the residual for Observation 9 as a suspected outlier. Does your histogram agree? (c) Redoing the regression without Observation 9 gives regression standard error s \u0002 42.4725 and predicted mean velocity 216 feet per second (90% confidence interval 191.4 to 240.6) for piston walls 0.5 inch thick. Compare these values with those in Figure 24.13. Is Observation 9 influential for inference? Table 4.1(page 103) gives 33 years' data on boats registered in Florida and manatees killed by boats. Figure 4.2 (page 103) shows a strong linear relationship. The correlation is r \u0002 0.951. Figure 24.14 shows part of the Minitab regression output. Exercises 24.29 to 24.31 analyze the manatee data. 24.29 Manatees: conditions for inference. We know that there is a strong linear relationship. Let's check the other conditions for inference. Figure 24.14 includes a table of the two variables, the predicted values y for each x in the data, MANATEESRES the residuals, and related quantities. (a) Round the residuals to the nearest whole number and make a stemplot. The distribution is single-peaked and symmetric and appears close to Normal. (b) Make a residual plot, residuals against boats registered. Use a vertical scale from \u000425 to 25 to show the pattern more clearly. Add the \"residual \u0002 0\" line. There is no clearly nonlinear pattern. The spread about the line may be a bit greater for larger values of the explanatory variable, but the effect is not large. (c) It is reasonable to regard the number of manatees killed by boats in successive years as independent. The number of boats grew over time. Someone says that pollution also grew over time and may explain more manatee deaths. How would you respond to this idea? 24.30 Manatees: do more boats bring more kills? The output in Figure 24.14 omits the t statistics and their P-values. Based on the information in the output, is there good evidence that the number of manatees killed increases as the Chapter 24 Exercises 617 number of boats registered increases? State hypotheses and give a test statistic and its approximate P-value. What do you conclude? 24.31 Manatees: estimation. The output in Figure 24.14 includes prediction of the number of manatees killed when there are 1,050,000 boats registered in Florida. Give 95% intervals for (a) the increase in the number of manatees killed for each additional 1000 boats registered. (b) the number of manatees that will be killed next year if there are 1,050,000 boats registered next year. 24.32 Fidgeting keeps you slim: inference. Our first example of regression (Example 5.1, page 125) presented data showing that people who increased their nonexercise activity (NEA) when they were deliberately overfed gained less fat than other people. Use software to add formal inference to FATGAIN the data analysis for these data. (a) Based on 16 subjects, the correlation between NEA increase and fat gain was r \u0002 \u00040.7786. Is this significant evidence that people with higher NEA increase gain less fat? (Report a t statistic from regression output and give the onesided P-value.) (b) The slope of the least-squares regression line was b \u0002 \u00040.00344, so that fat gain decreased by 0.00344 kilogram for each added calorie of NEA. Give a 90% confidence interval for the slope of the population regression line. This rate of change is the most important parameter to be estimated. (c) Sam's NEA increases by 400 calories. His predicted fat gain is 2.13 kilograms. Give a 95% interval for predicting Sam's fat gain. 24.33 Predicting tropical storms. Exercise 5.55 (page 155) gives data on William Gray's predictions of the number of named tropical storms in Atlantic hurricane seasons from 1984 to 2010. Use these data for regression inference STORMS2 as follows. (a) Does Professor Gray do better than random guessing? That is, is there a significantly positive correlation between his forecasts and the actual number of storms? (Report a t statistic from regression output and give the one-sided P-value.) (b) Give a 95% confidence interval for the mean number of storms in years when Professor Gray forecasts 16 storms. 24.34 Coral growth. Sea surface temperatures across much of the tropics have been increasing since the mid-1970s. At the same time, the growth of coral has been decreasing. Scientists examined data on mean sea surface temperatures c24InferenceforRegression.indd Page 618 10/4/11 8:38:19 PM ff-446 /Users/ff-446/Desktop/4:10:2011 FIGURE 24.14 Minitab output for the regression of number of manatees killed by boats on the number of boats (in thousands) registered in Florida, for Exercises 24.29 to 24.31. Regression Analysis: Kills versus Boats The regression equation is Kills = -43.2 + 0.129 Boats Coef Predictor Constant -43.172 0.129232 Boats S = 8.05174 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Boats 447 460 481 498 513 512 526 559 585 614 645 675 711 719 681 679 678 696 713 732 755 809 830 880 944 962 978 983 1010 1024 1027 1010 982 SE Coef 5.716 0.007520 R-Sq = 90.5% Kills 13.00 21.00 24.00 16.00 24.00 20.00 15.00 34.00 33.00 33.00 39.00 43.00 50.00 47.00 53.00 38.00 35.00 49.00 42.00 60.00 54.00 66.00 82.00 78.00 81.00 95.00 73.00 69.00 79.00 92.00 73.00 90.00 97.00 Fit 14.59 16.27 18.99 21.19 23.12 23.00 24.80 29.07 32.43 36.18 40.18 44.06 48.71 49.75 44.84 44.58 44.45 46.77 48.97 51.43 54.40 61.38 64.09 70.55 78.82 81.15 83.22 83.86 87.35 89.16 89.55 87.35 83.73 T P R-Sq(adj) = 90.2% SE Fit 2.59 2.51 2.38 2.28 2.19 2.20 2.12 1.94 1.81 1.68 1.56 1.48 1.42 1.41 1.46 1.47 1.47 1.43 1.41 1.40 1.41 1.50 1.57 1.77 2.10 2.20 2.29 2.32 2.49 2.57 2.59 2.49 2.32 Residual St Resid -0.21 -1.59 0.62 4.73 0.65 5.01 -0.67 -5.19 0.11 0.88 -0.39 -3.00 -1.26 -9.80 0.63 4.93 0.07 0.57 -0.40 -3.18 -0.15 -1.18 -0.13 -1.06 0.16 1.29 -0.35 -2.75 1.03 8.16 -0.83 -6.58 -1.19 -9.45 0.28 2.23 -0.88 -6.97 1.08 8.57 -0.05 -0.40 0.58 4.62 2.27R 17.91 0.95 7.45 0.28 2.18 1.79 13.85 -1.32 -10.22 -1.93 -14.86 -1.09 -8.35 0.37 2.84 -2.17R -16.55 0.35 2.65 1.72 13.27 R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs 1 Fit 92.52 SE Fit 2.74 95% CI (86.93, 98.11) 95% PI (75.18, 109.87) c24InferenceforRegression.indd Page 619 10/4/11 8:38:20 PM ff-446 /Users/ff-446/Desktop/4:10:2011 (SST) in degrees Celsius and mean coral growth in millimeters (mm) per year over a several-year period at locations in CORAL the Red Sea. Here are the data:9 SST 29.68 29.87 30.16 30.22 30.48 30.65 30.90 Growth 2.63 2.48 2.26 2.58 2.60 2.38 2.26 Residual 29.68 29.87 30.16 30.22 30.48 30.65 30.90 \u00040.067 \u00040.060 0.128 0.066 0.025 \u00040.024 \u00040.068 (a) Linear relationship. A plot of the residuals against the explanatory variable x magnifies the deviations from the least-squares line. Does the plot show any systematic deviation from a roughly linear pattern? (b) Normal variation about the line. Make a histogram of the residuals. With only 7 observations, no clear shape emerges. Do strong skewness or outliers suggest lack of Normality? (c) Independent observations. Why are the 7 observations independent? (d) Spread about the line stays the same. Does your plot in (a) show any systematic change in spread as x changes? 24.37 Our brains don't like losses. Exercise 4.29 (page 117) describes an experiment that showed a linear relationship between how sensitive people are to monetary losses (\"behavioral loss aversion\") and activity in one part of their LOSSES brains (\"neural loss aversion\"). (a) Make a scatterplot with neural loss aversion as x and behavioral loss aversion as y. One point is a high outlier in both the x and y directions. In Exercise 5.38 (page 152) you found that this outlier is not influential for the least-squares line. 619 (b) The research report says that r \u0002 0.85 and that the test for regression slope has P \u0007 0.001. Verify these results, using all the observations. (c) The report recognizes the outlier and says, \"However, this regression also remained highly significant (P \u0002 0.004) when the extreme data point (top right corner) was removed from the analysis.\" Repeat your analysis omitting the outlier. Show that the outlier influences regression inference by comparing the t statistic for testing slope with and without the outlier. Then verify the report's claim about the P-value of this test. 24.38 Time at the table. Does how long young children remain at the lunch table help predict how much they eat? Here are data on 20 toddlers observed over several months at a nursery school.10 \"Time\" is the average number of minutes a child spent at the table when lunch was served. \"Calories\" is the average number of calories the child consumed during lunch, calculated from careful observation of what the child TIMEATTABLE ate each day. (a) Do the data indicate that coral growth decreases linearly as SST increases? Is this change statistically significant? (b) Use the data to predict with 95% confidence the mean coral growth (mm per year) when SST is 30.0 degrees Celsius. 24.35 Predicting tropical storms: residuals. Make a stemplot of the residuals (round to the nearest tenth) from your regression in Exercise 24.33. Explain why your plot suggests that we should not use these data to get a prediction interval for the STORMS2 number of storms in a single year. 24.36 Coral growth: residuals. Do the data in Time Exercise 24.34 on mean sea surface temperatures and Calories coral growth in the Red Sea satisfy the conditions for regression inference? To examine this, here are the Time CORALRES residuals: Calories SST Chapter 24 Exercises 21.4 30.8 37.7 33.5 32.8 39.5 22.8 34.1 33.9 43.8 472 498 465 456 423 437 508 431 479 454 42.4 43.1 29.2 31.3 28.6 32.9 30.6 35.1 33.0 43.7 450 410 504 437 489 436 480 439 444 408 (a) Make a scatterplot. Find the correlation and the leastsquares regression line. (Be sure to save the regression residuals.) Based on your work, describe the direction, form, and strength of the relationship. (b) Check the conditions for regression inference. Parts (a) to (d) of Exercise 24.36 provide a handy outline. Use vertical limits \u0004100 to 100 in your plot of the residuals against time to help you see the pattern. What do you conclude? (c) Is there significant evidence that more time at the table is associated with more calories consumed? Give a 95% confidence interval to estimate how rapidly calories consumed changes as time at the table increases. 24.39 DNA on the ocean floor. We think of DNA as the stuff that stores the genetic code. It turns out that DNA occurs, mainly outside living cells, on the ocean floor. It is important in nourishing seafloor life. Scientists think that this DNA comes from organic matter that settles to the bottom from the top layers of the ocean. \"Phytopigments,\" which come Minoru Toi/Getty Images mainly from algae, are a measure of the amount of organic matter that has settled to the bottom. The data contains c24InferenceforRegression.indd Page 620 10/4/11 8:38:21 PM ff-446 620 CHAP TER 24 /Users/ff-446/Desktop/4:10:2011 Inference for Regression data on concentrations of DNA and phytopigments (both in grams per square meter) in 116 ocean locations around the world.11 Look first at DNA alone. Describe the distribution of DNA concentration and give a confidence interval for the mean concentration. Be sure to explain why your confidence interval is trustworthy in the light of the shape of the distribution. The data show surprisingly high DNA concentrations, and this by itself was an important DNA finding. 24.40 Time at the table: prediction. Rachel attends the nursery school of Exercise 24.38. Over several months, Rachel averages 40 minutes at the lunch table. Give a 95% interval to predict Rachel's average calorie consumption at TIMEATTABLE lunch. Exercises 24.41 to 24.45 ask practical questions involving regression inference without step-by-step instructions. Do complete regression analyses, using the Plan, Solve, and Conclude steps of the four-step process to organize your answers. Follow the model of Example 24.9 (page 608) and the following discussion, and check the conditions as part of the Solve step. 24.41 Squirrels and their food supply. The introduction to Exercises 7.24 to 7.26 (pages 185 and 186) gives data on the abundance of the pine cones that red squirrels feed on and the mean number of offspring per female squirrel over 16 years. The strength of the relationship is remarkable because females produce young before the food is available. How significant is the evidence that more cones leads to more offspring? (Use a vertical scale from \u00042 to 2 in your residual SQUIRRELS plot to show the pattern more clearly.) 24.42 A big-toe problem. Table 7.4 (page 194) and Exer- cises 7.47 and 7.49 describe the relationship between two deformities of the feet in young patients. Metatarsus adductus (MA) may help predict the severity of hallux abducto valgus (HAV). The paper that reports this study says, \"Linear regression analysis, using the hallux abducto angle as the response variable, demonstrated a significant correlation between the metatarsus adductus and hallux abducto angles.\"12 Do a suitable analysis to verify this finding. The study authors note that the scatterplot suggests that the variation in y may change as x changes, so they offer a more DEFORMITY elaborate analysis as well. 24.43 Beavers and beetles. Exercise 5.53 (page 155) describes a study that found that the number of stumps from trees felled by beavers predicts the abundance of beetle larvae. Is there good evidence that more beetle larvae clusters are present when beavers have left more tree stumps? Estimate how many more clusters accompany each additional BEAVERS stump, with 95% confidence. 24.44 Sulfur, the ocean, and the sun. Sulfur in the atmosphere affects climate by influencing formation of clouds. The main natural source of sulfur is dimethylsulfide (DMS) produced by small organisms in the upper layers of the oceans. DMS production is in turn influenced by the amount of energy the upper ocean receives from sunlight. Exercise 4.30 (page 117) gives monthly data on solar radiation dose (SRD, in watts per square meter) and surface DMS concentration (in nanomolars) for a region in the Mediterranean. Do the data provide convincing evidence that DMS increases as SRD increases? We also want to estimate the rate SULFUR of increase, with 90% confidence. 24.45 DNA on the ocean floor. Another conclusion of the study introduced in Exercise 24.39 was that organic matter settling down from the top layers of the ocean is the main source of DNA on the seafloor. An important piece of evidence is the relationship between DNA and phytopigments. Do the data give good reason to think that phytopigment concentration helps explain DNA concentration? (Try vertical limits \u00041 to 1 to make the pattern of your residual DNA plot clearer.) 24.46 A lurking variable (optional). Return to the data on selling price versus appraised value for beachfront condominiums that are the basis for the Check Your Skills Exercises 24.16 to 24.24. The data are in order by date of the sale, and the data table includes the number of months from the start of the data period. Here are the residuals from the regression CONDORES of selling price on appraised value (rounded): \u000455.90 190.90 332.72 267.81 \u000418.90 \u0004120.78 \u000471.54 \u0004125.09 589.37 \u000477.84 \u0004192.01 \u000492.05 \u0004143.97 \u0004234.53 \u0004103.08 \u0004199.21 \u000463.88 119.76 523.78 \u0004252.09 \u0004132.21 \u0004151.95 256.58 \u0004269.22 \u0004182.24 193.30 168.15 \u0004155.85 (a) Plot the residuals against the explanatory variable (appraised value). To make the pattern clearer, use vertical limits \u0004600 to 600. Does the pattern you see agree with the conditions of linear relationship and constant standard deviation needed for regression inference? (b) Make a stemplot of the residuals. Are there strong deviations from Normality that would prevent regression inference? (c) Next, plot the residuals against month. Are the positive and negative residuals randomly scattered, as would be the case if the conditions for regression inference are satisfied? (Comment: Prices for beachfront property were rising rapidly during the first 36 months of this period. Because property is reassessed just once a year, selling prices might pull away from appraised values over time in this period, creating a pattern of many negative residuals followed by several positive c24InferenceforRegression.indd Page 621 10/4/11 8:38:23 PM ff-446 /Users/ff-446/Desktop/4:10:2011 residuals. As this example illustrates, it is often wise to plot residuals against important lurking variables as well as against the explanatory variable.) 24.47 Standardized residuals (optional). Software often calculates standardized residuals as well as the actual residuals from regression. Because the standardized residuals have the standard z-score scale, it is easier to judge whether any are extreme. Figure 24.13 (page 616) and the associated data include the standardized residuals for the regression of gate velocity on piston wall thickness. (a) Find the mean and standard deviation of the standardized residuals. Why do you expect values close to those you obtain? (b) Make a stemplot of the standardized residuals. Are there any striking deviations from Normality? (c) The most extreme standardized residual is z \u0002 \u00042.21. Minitab flags this as \"large.\" What is the probability that a standard Normal variable takes a value this extreme (that is, less than \u00042.21 or greater than 2.21)? Your result suggests that a residual this extreme would be a bit unusual when there are only 12 observations. That's why we examined Observation 9 in Exercise 24.28. 24.48 Tests for the intercept (optional). Figure 24.7 (page xxx) gives Minitab output for the regression of blood alcohol Exploring the Web 621 content (BAC) on number of beers consumed. The t test for the hypothesis that the population regression line has slope \u0003 \u0002 0 has P \u0007 0.001. The data show a positive linear relationship between BAC and beers. We might expect the intercept \u0004 of the population regression line to be 0, because no beers (x \u0002 0) should produce no alcohol in the blood (y \u0002 0). To test H0: a \u0002 0 Ha: a 0 we use a t statistic formed by dividing the least-squares intercept a by its standard error SEa. Locate this statistic in the output of Figure 24.7 and verify that it is in fact a divided by its standard error. What is the P-value? Do the data suggest that the intercept is not 0? 24.49 Confidence intervals for the intercept (optional). The output in Figure 24.7 (page 603) allows you to calculate confidence intervals for both the slope \u0003 and the intercept \u0004 of the population regression line of BAC on beers in the population of all students. Confidence intervals for the intercept \u0004 have the familiar form a t*SEa with degrees of freedom n \u0004 2. What is the 95% confidence interval for the intercept? Does it contain 0, the value we might guess for \u0004? EXPLORING THE WEB 24.50 Predicting batting averages. As you did in Exercise 5.59, go to www.mlb.com/ and find the batting averages for a diverse set of 30 players for both the 2009 and 2010 seasons. You can click on the \"Stats\" tab to find the results for the current season as well as historical data. You should select only players who played in at least 50 games both seasons. Find the least-squares regression line for predicting batting average in 2010 from that in 2009 based on your sample of 30 players. In 2009, the major league leader in batting was Joe Mauer, who had a batting average of .365. Find a 95% prediction interval for the 2010 batting average of someone who hit .365 in 2009. How does this prediction compare with Joe Mauer's 2010 batting average? 24.51 Olympic medal counts. In Exercise 4.48 you made a scatterplot of the Winter Olympics medal counts for 2002 and 2006. We investigate these medal counts further. Go to the Chance News Web site at www.causeweb.org/wiki/chance/index.php/Chance_ News_61#Predicting_medal_counts and read the article \"Predicting Medal Counts.\" Next, search the Web (as you did in Chapter 4) and locate the Winter Olympics medal counts for 2002 and 2006 (I found Winter Olympics medal counts on Wikipedia). Find the equa
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started