Question

1 Approved Answer

Posted on Oct 13, 2024

Carleton University School of Mathematics and Statistics Sampling Methodology: STAT 3507 - Winter 2016 Assignment # 3 (Due Mar 10, 2016) 1. A sample survey

Carleton University School of Mathematics and Statistics Sampling Methodology: STAT 3507 - Winter 2016 Assignment # 3 (Due Mar 10, 2016) 1. A sample survey is being planned in which it is desired to estimate, the ratio of medical expenses to family income in a large city containing 234,785 families as well as the total money spent on medical expenses based on data in the table below. Family Number Family Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Weekly Net Income ($) 372 372 522 390 348 552 528 574 498 372 378 372 360 450 540 450 414 498 510 438 396 348 462 414 390 462 414 570 462 414 414 402 378 2 3 3 5 4 7 2 4 2 5 3 6 4 4 2 5 3 4 2 4 2 5 3 4 7 3 3 6 2 2 6 4 2 1 Weekly Medical Expenses ($) 28.60 41.60 45.40 61.00 82.40 56.40 48.40 60.00 48.40 88.80 26.80 39.60 58.80 54.20 44.20 75.40 45.20 72.00 21.20 55.40 51.90 46.60 79.60 33.60 75.60 69.60 57.40 126.00 39.10 43.20 36.40 40.20 41.40 a. How many families would have to be sampled if it is desired to estimate the ratio of medical expenses to family income in the large city with 95% certainty to within 5% of its true value? b. How many families would have to be sampled if it is desired to estimate the total moneys spent on medical expenses (using the customary estimator) to within 10% of its true value? Use 95% confidence level. c. How many families would have to be sampled if it is desired to estimate the total moneys spent on medical expenses (using the ratio estimator) to within 10% of its true value? Use 95% confidence level. 2. It is desired to estimate the total Y of nurse practitioner hours spent in direct patient care in a large HMO during a given year. This is to be done by taking a simple random sample of patients and determining, for each visit during the year, the number of nurse practitioner hours spent during the visit. It is known that there are 3524 members of the HMO and that 8950 visits took place during that year. A small pilot sample of 10 patients yielded the following data: Patient 1 2 3 4 5 6 7 8 9 10 Visits 0 5 1 2 3 7 1 1 0 4 Nurse Practitioner Hours 0 3 0 6 3 0 0 2 0 3 a. Based on these pilot data, would you recommend a customary estimate or a ratio estimator as the estimation method for the sample survey? Document the reasons for your recommendation. b. Based on this data, how many patients would have to be sampled to estimate the total of nurse practitioner hours with a coefficient of variation not exceeding 10%? c. Based on this data, how many patients would have to be sampled to estimate the total of nurse practitioner hours to with a coefficient of variation not exceeding 10%? 3. A corporation is interested in estimating the total earnings from sales of color television sets at the end of a 3-month period. The total earnings figures are available for all districts within the corporation for the corresponding 3-month period of the previous year. A simple random sample of 13 district offices is selected from the 123 offices within the corporation. The resulting data are shown in the table below: 2 Three-month data from previous year, Office 1 2 3 4 5 6 7 Three-month data from current year, xi yi 550 720 1500 1020 620 980 928 Three-month data from previous year, Office 610 780 1600 1030 600 1050 977 8 9 10 11 12 13 Three-month data from current year, xi yi 1200 1350 1750 670 729 1530 1440 1570 2210 980 865 1710 a. Obtain a scatter plot for y i versus xi and fit a simple linear regression. What estimator does the model suggest? Explain. b. Estimate the mean (Y ) earnings for offices within the corporation and place a bound on the error of estimation using the estimator suggested by (a). c. Estimate the total earnings ( Y ) and place a bound on the error of estimation using the estimator suggested in (a). (Hint: for (b) and (c), the answer is either the ratio estimator or the regression estimator or the difference estimator. Take X = 128,200 ). 4. An investigator has a colony of N = 763 rats that have been subjected to a standard drug. The average length of time to thread a maze correctly under influence of the standard drug was found to be X = 17.2 seconds. The investigator now would like to subject a random sample of 11 rats to a new drug. The data obtained are show in the table below: Rat 1 2 3 4 5 6 7 8 9 10 11 Standard drug, xi 14.3 15.7 17.8 17.5 13.2 18.8 17.6 14.3 14.9 17.9 19.2 3 New drug, y i 15.2 16.1 18.1 17.6 14.5 19.4 17.5 14.1 15.2 18.1 19.5 a. Use ratio estimator to estimate the average (Y ) time required to thread the maze while under the influence of the new drug; and construct a 95 % confidence interval. b. Use regression estimator to estimate Y and construct a 95% confidence interval. c. Use a difference estimator to estimate Y and construct a 95% confidence interval. d. Which method would you recommend? 5. A variance estimator of R (ratio estimator) of a population ratio R is n n 1 V ( R ) = 1 2 N nX (y i Rxi ) 2 i =1 n 1 a. Show that the above is algebraically equivalent to n 1 2 V ( R ) = 1 ( s 2 2 R s x s y + R 2 s x ) y 2 N nX Where, is the sample correlation coefficient of x and y for the values in the 2 sample, s x and s 2 are the sample variances for x and y respectively. y b. Show that the ratio estimator of the mean may be more efficient than the 1 cv (x ) simple mean if and only if: > where cv is the estimate of the 2 cv ( y ) coefficient of variation. Note that cv ( x ) = s x / X ; cv ( y ) = s y / Y ; = s xy /( s x s y ) 4 Ratio and Regression Estimators (Examples) under Simple Random Sampling Without Replacement Set 1: Estimation of Ratio Let us consider a community having eight community areas. Suppose that we wish to estimate the ratio R of total pharmaceutical expenses Y to total medical expenses X among all persons in the community. To do this, a simple random sample of two community areas is to be taken and every household in each sample community area is to be interviewed. The data for the community areas are as given in Table 1 below: Table 1: Pharmaceutical Expenses and Total Medical Expenses Among All Residents of Eight Community Areas Community Total Pharmaceutical Total Medical Area Expenses, Y ($) Expenses, X ($) 1 100,000 300,000 2 50,000 200,000 75,000 300,000 3 4 200,000 600,000 5 150,000 450,000 6 175,000 520,000 7 170,000 680,000 150,000 450,000 8 Total 1,070,000 3,500,000 Suppose that community areas 2 and 5 were selected in the sample. a. b. c. d. e. f. State the target population for the study What are the elements in this sampling design? What are the sampling units? Estimate the ratio of total pharmaceutical expenses to total medical expenses. Estimate the variance of the estimate in (d). Based on the sample data, how many community areas would have to be sampled if it is desired to estimate the population ratio with 95% certainty to within 5% of its true value? 1 Solution a. The target population includes all 8 community areas in that particular community. b. The element for this sampling design is a community area. c. The sampling unit is a community area. d. The estimate of the population ratio of total pharmaceutical expenses to total medical expenses is given by n Y = R= X n N yk / n k =1 n N xk / n y = k =1 k k =1 n x = 50,000 + 150,000 = 0.3077 200,000 + 450,000 k k =1 e. The variance estimate of R is approximately given by 2 n 1 s 1 n 2 V ( R) 1 2 r , where s r2 = y k Rxk = s 2y 2 Rs x s y + R 2 s x2 n 1 k =1 NX n From the data, 2 9 2 10 X = 3,500,000 / 8 = 437,500; s y = 5 10 ; s x = 3.123 10 ; s xy = 1.25 1010 ; = 1 therefore 2 s 2 2 Rs x s y + R 2 s x ( R ) 1 n 1 y V = ... = 0.000521589 2 n NX f. Let =percentage by which the estimate is within the true value =5% , from the information provided (1) P | R R | R = 100 (1 )% = 95% and from the confidence bound, P | R R | z / 2 V ( R ) = 100(1 )% (2) From (1) and (2) z V ( R ) = R ( ( ( ) ) ) /2 2 N n 1 R z / 2 = R 2 N 1 X n Solving for n leads to 2 2 Nz / 2 R n= 2 2 (3) z / 2 R + ( N 1)(R ) 2 X 2 For large sample size, the normality assumption may apply and so one may use z / 2 . Replace 2 by the sample estimate s r2 = 266227812.5 , R = Y / X = 1070000/ 3500000 = 0.3057 , z / 2 = z 0.025 = 1.96 into (3) leads to n = 6.125 6 . 2 X = 3500000 / 8 = 437500 and Set 2: Estimation of Total Suppose that a road having a length of 24 miles traverses areas that can be classified as urban and rural and that the road is divided into eight segments, each having a length equal to 3 miles. A sample of three segments is taken, and on each segment sampled, special equipment is installed for purposes of counting the number of total motor vehicle miles traveled by cars and trucks on the segment during a particular year. In addition, a record of all accidents occurring on each sample segment is kept. The number of truck miles and the number of accidents in which a truck was involved during a certain period are given in Table 2 for each of the eight segments in the population. Suppose that we take a simple random sample of three segments for purposes of estimating the total number of truck miles traveled on the road. Table 2: Truck Miles and Number of Accidents Involving Trucks by Type of Road Segment Truck Miles Number of Segment Type x1000 Accidents 1 Urban 6327 8 2 Rural 2555 5 3 Urban 8691 9 4 Urban 7834 9 5 Rural 1586 5 6 Rural 2034 1 7 Rural 2015 9 8 Rural 3012 4 Suppose the segments 1, 3 and 4 were selected in the sample. a. Estimate the total number of truck miles traveled on the road using the customary and ratio estimators. b. Estimate the 95% confidence interval for the total number of truck miles using the customary and the ratio estimators. c. How do these estimators compare? d. Based on the sample data, how many road segments would have to be sampled if it is desired to estimate the total number of truck miles with 95% certainty to within 10% of its true value? Use the customary estimator and the ratio estimator. 3 Solution Let N=8, n=3 and y= number of truck miles traveled on road segment a. The customary estimator of the total number of truck miles is n 8 N Y = y k = (6327 + 8691 + 7834 ) = 60,938.67 (1000) n k =1 3 The ratio estimator of the total number of truck miles is = Y X = 22852 / 3 50 = 43,946.15 (1000) Y 26 / 3 X b. The variance estimate under the customary estimator is given by 2 n sy 2 V (Y ) = N 2 1 ; s y = 1,432,332.3333 N n 3 1432332.3333 = 8 2 1 3 8 = 19,097,764.44 The 95% confidence interval for the total is Y z 0.025 V (Y ) 60,938.6 1.96 19,097,764.44 60,938.67 1.96 4,370.0989 The variance estimate under the ratio estimator is given by n s2 3 555730.5148 V (Yr ) N 2 1 r = 8 2 1 = 7,409,740.197 where 3 N n 8 s 2 = s 2 2 Rs s + R 2 s 2 = L = 555730.5148 r y x y x s = 0.3333; R = 878.9231; = 0.9337 2 x The 95% confidence interval for the total is Yr z0.025 V (Yr ) 43,946.15 1.96 7,409,740.197 43,946.15 1.96 2,722.084 c. The ratio estimate appears to be more efficient than the customary estimate since the variance estimate under the ratio estimator is smaller than that of the customary estimator. d. Let =percentage by which the estimate is within the true value =10% , from the information provided (1) P | Yr Y | Y = 100(1 )% = 95% and from the confidence bound, P | Yr Y | z / 2 V (Yr ) = 100(1 )% (2) From (1) and (2) ( ) 4 z / 2 V (Yr ) = Y 2 N n R z / 2 = Y N 1 n Solving for n leads to 2 2 N 3 z / 2 R n= 2 2 (3) 2 N z / 2 R + ( N 1)(Y ) 2 For large sample size, the normality assumption may apply and so one may use 2 z / 2 . Replace R by the sample estimate s r2 = 555730.5148 , Y = 34,054 , and z / 2 = z 0.025 = 1.96 into (3) leads n = 5.018 5 . Following a similar arguments for the customary estimator and replacing the 2 population variance 2 by s y = 1,432,332.3333 and solving for (3) leads to n = 6.5 7 . 5