Question

1 Approved Answer

Posted on Jun 11, 2024

E. Z-test El Bookmark this page Exercises due May 25, 2022 0?:59 EDT In this section, we will discuss a popular and versatile approach to

E. Z-test El Bookmark this page Exercises due May 25, 2022 0?:59 EDT In this section, we will discuss a popular and versatile approach to hypothesis testing on continuous data, the ztest , which makes use of the Central Limit Theorem lCLTJ. We will apply this test to the sleeping drug study. Afterwards, we will see how the ztest is also helpful as an approximation when the data is discrete, such as in the mammography study. Modeling choice for the sleeping drug study When our data was binary, we are typically limited to the Bernoulli model and the corresponding binomial model for the number of targeted observations. When our data can take on continuous values, we have more choices. Depending on the application, we can use one of several well-known distributions, including the unifom'l, exponential, and nom'lal distributions. Recall the data collected for the sleeping dmg study: 15.]. TB 3.2 NS 6.5 7.5 6.9 Ii? Tull 5.5 5.2 'i'.li 3.9 4.? 5.3 4.5 {.2 .1 3.3 EA! Suppose our candidate models for the difference in number of hours slept are the uniform and the Gaussian models. Both the support and the distribution are important considerations: - The support of a model is the set of values that the observations can take in the model. In the sleeping drug study, the number of hours slept in a clay is bounded above, so the difference is also bounded. This points in favor of the uniform model, as it has a bounded support, while a lGaussian model always has unbounded support - The distribution of a continuous model is based on the shape of the pdf. In model selection, this can be decided based on solving a theoretical model, looking at the empirical distribution of observations, or common knowledge. The number of hours slept by an adult is known to be centered around 3 hours, and outliers tend to be rare, so this points towards the Gaussian modelfor the sleeping drug study. Weighing these two considerations, in the sleeping drug study, we select the normal distribution and then ensure that the variance parameter is sufciently small, so that the probability of falling outside the realistic boundary is neglble. Furthermore, we can argue towards a normal distribution by reasoning that the number of hours slept is a cumulative effect of a large number of biological and lifestyle variables. As a lot of these variables are unrelated to one another, the cumulative effect can be approximated by a nom'lal distribution. This is justified by the Central Limit Theorem iCLT], which is covered in more detail below, and is the important result that establishes the ztest. Central limit theorem (CLT) and the z-test statistic Suppose that we have observations X1, ..., An, which are independent and identically distributed based on a probability model. Under a few regularity assumptions (such as the model having a finite second moment), the distribution of the sample mean X will approximate a normal distribution when sample size becomes sufficiently large (typically n > 30). The central limit theorem (CLT) states that: When sampling random variables X], .... X, from a population with mean and variance o', X is approximately normally distributed with mean /4 and variance of when n is large: X - Kit Xet . . . + An ~ N ( I'm ) for nlarge. Hence, we can define a test statistic & = X - H o//7 Which approximately follows a standard normal distribution when it is large: X - H ~ N (0, 1) . o/vn The test statistic & is called an (approximate) pivotal quantity, since its (approximate) distribution does not depend on the paramaters / or . We can use the cdf of a pivotal quantity to compute the p-value (which is the probability for the test statistic to take on a value at least as extreme as the one observed), and compare the p-value with or the significance level to decide whether to reject the null hypothesis Ho. Z-test in the sleeping drug study? We are interested in testing the efficacy of a sleeping drug. The data collection process recorded the hours of sleep of 10 patients under the drug and under the placebo: patient 1 2 3 4 5 6 7 8 9 10 drug 6.1 7.0 8.2 7.6 6.5 7.8 6.9 6.7 7.4 5.8 placebo 5.2 7.9 3.9 4.7 5.3 4.8 4.2 6.1 3.8 6.3 Now, we want to answer the question: "Does the drug increase hours of sleep enough to matter?' We model the difference of hours of sleep between the drug and the placebo for each patient as a normal random variable: Model: X1, . .., X10 ~ N (#, ?) (X1, for example, would be: 6.1 - 5.2 = 0.9). From this, we state the hypotheses for a one-sided test: Null hypothesis (Ho ): / = 0 Alternative hypothesis (HA ): / > 0.. Alternative hypothesis 1H1]: ,u. :2- El. Since the data Xi are modeled as independent Gaussians, the Jae-test statistic described above has an standard normal distribution under the null hypothesis HI}, even without using the central limit theorem. 1 a! v i ' Since we do not ltnow the population variance in this experiment, we cannot use the ztest. We consider using c as the test statistic. However, to calculate a = we need to ltnow the true value of the variance It. In general, if samples cannot be modeled as Gaussian variables, then the sample size also needs to be large in order to use the standard noITnal to approximate 5 using the CH. The stest resolves both issues of the unknown true variance and the required large sample size. Application to the mammography study We conduct the ztest for the mam rr'rograph'_u.uI study with the following model a nd hypotheses: - Model: X1, . . . ,Xgmm kw 361110111]: [Tr] each indicating whether a patient in the treatment group dies of breast cancer - Null hypothesis HI}: air = fl; Altemaljve hypothesis HA: if 4: fl. As done in lecture 1, we have assumed in the null hypothesis that 11' = fl :5 0,1302% is the true reference value for the death rate without treatment Hence, we will assume the true variance of X to be the corresponding value or = ail 1:; as 3.045. The ztest statistic is: X w swsmoo fll} [3.1: = = was3.0263. 3 31,5 ,Wsa owns} {1 sshmunjwsmon The p-value can be calculated from the area under the pdf of the sta ndard normal distribution to the left of the zval ue above: 2 Test Exercise 2 points possible [graded] 1. Calculate p-UEILJE for the mammography study using the z-test described above. {Please enter the value 1with a precision of 4 digits after the decimal point. Hint: you could use the ncrm.cd+ function in the scipthats package in Python. 2. Let X; be the difference of hours of sleep between drug and placebo. What is a reasonable data generation model? O X; HM'EIILJELXI independent of each other C] Xi n.- Poissonth], Xi independent of each other O X; m Binomial [nylfi independent of each other