16-13. It is not unusual for a regression study to be performed using a data set containing...

Question:

16-13. It is not unusual for a regression study to be performed using a data set containing outlier observations; that is, data points that are remote or distinctly separated from the main scatter of observations. Clearly outlier points can have a significant effect on the regression results—they can exhibit inordinately large residuals that, in turn, can impact the location of the estimated regression line.

How can outliers be identified? Moreover, if a sample point is deemed an outlier, should it be retained or eliminated? A data point can be an outlier with respect to its X coordinate, its Y coordinate, or both (see Figure 16.E.1).

(X1,Y1) (X3,Y3)

(X2,Y2)

(X4,Y4)

(X5,Y5)

Y X

Figure 16.E.1 Points (X1,Y1) and (X2,Y2) are outliers with respect to both the X and Y coordinates since the said coordinates are well outside of the main scatter of points; (X3,Y3) is an outlier with respect to its Y coordinate (its X coordinate is near the middle of the range of X values); and (X4,Y4) is outlying with respect to its X coordinate (the Y coordinate is near the middle of the range of Y values).
Outlying points may not be equally influential on the regression results.
That is, although points (X1,Y1) and (X2,Y2) are consistent with the regression equation passing through the nonoutlying points, this will not be the case for (X5,Y5).

(a) Identifying outlying X values.
It can be shown that the variance of the residual ei can be expressed as σ2 ei = σ2 ε (1 − hii), (16.E.6)
where hii = X2 i − 2nXXi + nX2 i n(X2 i − nX2)
= 1 n + (Xi − X)2 (Xi − X)2 = 1 n + x2 i x2 i , i = 1, . . . , n, (16.E.7)
with 0 ≤ hii ≤ 1 andni =1 hii = number of parameters estimated = 2.
The quantity hii, called the leverage of Xi, is useful in indicating whether or not Xi is outlying since it essentially measures the distance of Xi from the center of all X values. If Xi exhibits a large leverage value hii, then it makes a substantial contribution to the determination of the estimated Y value ˆYi; that is, if hii is large, the more important Yi is in determining ˆYi. In addition, as (16.E.6)
reveals, the larger is hii, the smaller is the variance of ei. In this regard, the larger is hii, the smaller will be the difference ei = Yi − ˆYi. (Since hii and ei vary inversely, the detection of outliers cannot be left to simply an examination of the least squares residuals ei.)
A leverage value hii is deemed large if it is more than twice as large as the mean leverage value ¯h = ni =1 hii>n = 2n . That is, Xi is considered to be an outlier if hii > 4n . Alternatively, it is sometimes suggested that Xi exhibits high leverage if hii > 0.5 and Xi exhibits moderate leverage if 0.2 ≤ hii ≤ 0.5.

(b) Identifying outlying Y values.
To detect outlying Y observations we may employ what is called the studentized deleted residual d∗
i = ei n − 3 SSE(1 − hii) − e2 i
1/2 . (16.E.8)

(A deleted residual is di = Yi − ˆYi(−i), where ˆYi(−i) is the estimated Y value obtained by deleting (Xi,Yi) and fitting the regression equation to the remaining n − 1 observations. Once di has been divided by its estimated standard deviation, we obtain (16.E.8).) Here |d∗
i | is compared to, say, the upper 5% quantile of the t distribution; that is, a Yi value is considered to be an outlier if |d∗
i | > t0.05,n−3.

(c) Identifying influential data values Once it has been discovered that an X and/or Y value is an outlier, it remains to determine whether or not this outlier can be considered as influential—in the sense that its exclusion precipitates major changes in the estimated regression equation. A measure of the influence that the ith data point has on the estimated Y value, ˆYi, is provided by (DFFITS)i = d∗
i hii 1 − hii
1/2 , (16.E.9)
with this quantity essentially representing the ith studentized deleted residual increased or decreased by a scale factor dependent upon the ith leverage value; it amounts to the number of estimated standard deviations that ˆYi changes when the ith observation (Xi,Yi) is removed from the data set. (Note that DF stands for the difference between ˆY i and ˆYi(−i), where, as defined earlier, ˆYi(−i) is the estimated Y value obtained when the ith data point (Xi,Yi) is eliminated.) In general, Xi is an influential outlier if:
1. |(DFFITS)i| > 1 for small to medium-sized samples 2. |(DFFITS)i| > 2 √
2/n for large samples A global or overall measure of the combined influence of the ith data point (Xi,Yi) on all regression coefficients simultaneously is Cook’s distance measure Di = e2 i 2 MSE hii (1 − hii)2 , (16.E.10)
which essentially gauges the combination of the differences in the estimated regression coefficients before and after the ith data point is deleted. The magnitude of Di is compared to the quantile value of the F distribution with 2 and n − 2 degrees of freedom. If the quantile value of F2,n−2 is near 50% or more, we may conclude that the ith sample data point has considerable influence on the estimated regression coefficients.

Given the following sample data set, determine if any of the observations is an outlier. (Hint: Graph the n = 15 data points and then apply (16.E.7)–(16.E.10).)
X 4 6 7 7 8 9 10 11 12 14 14 16 16 18 18 Y 18 13 1 4 2 8 10 9 12 13 16 15 20 4 18