Question

1 Approved Answer

Posted on Oct 13, 2024

STAT 425 Spring 2015 Homework 4 (Posted on March 10; Due Friday March 20) Please submit your assignment on paper, following the Guidelines for Homework

STAT 425 Spring 2015 Homework 4 (Posted on March 10; Due Friday March 20) Please submit your assignment on paper, following the Guidelines for Homework posted at the course website. (Even if correct, answers might not receive credit if they are too dicult to read.) Remember to include relevant computer output. 1. We investigate a simple variable selection problem via the mean square error (MSE). In general if we estimate a parameter using a statistic T , then the mean square error of T is given by M SE(T ) = E[(T )2 ] = {E(T ) }2 + var(T ). Consider the following two regression models: yi = 0 + 1 x1i + 2 x2i + i , i = 1, . . . , n; (1) and yi = 0 + 1 x1i + i , i = 1, . . . , n (2) We assume that model (1) is the correct one: it holds with uncorrelated errors having mean zero and constant variance 2 . Model (2) is considered to be oversimplied because it leaves out the variable X2 . Denote the LS estimators for the two models as follows: 0 , 1 , 2 : Least squares estimates for Model (1) 0 , 1 : Least squares estimates for Model (2). In addition, the following notation will be useful: n n 2 (x1i x1 ) , SX1 X1 = i=1 i=1 n (x1i x1 )(x2i x2 ) SX1 X2 = (x2i x2 )2 , SX2 X2 = and r12 = i=1 SX1 X2 . SX1 X1 SX2 X2 We will compare the mean square errors of 1 and 1 as estimators of 1 in Model (1). (a) Using the results of Homework 3 problem 1 or otherwise, show that M SE(1 ) = 1 2 . 2 1 r12 SX1 X1 (b) Assuming Model (1) is correct show that E(1 ) = 1 + SX1 X2 2 = 1 + r12 SX1 X1 (c) Assuming Model (1) is correct, show that var(1 ) = 1 2 . SX1 X1 SX2 X2 2 . SX1 X1 (d) Assuming Model (1) is correct show that M SE(1 ) < M SE(1 ) whenever 2 2 < var(2 ) = 2 1 . 2 1 r12 SX2 X2 2. Use the seatpos data set with hipcenter as the response and only the variables Age, Weight, Ht, and Leg as possible predictors. Implement the following variable selection methods to determine a model. In each case, (i) show appropriate R output, and (ii) list the independent variables in the nal model. (a) Forward selection (use Fin = 3) (b) Backward elimination (use Fout = 3) (c) Selection with the R function leaps, according to minimum Cp 3. The trees data set provides measurements of the girth, height and volume of timber in 31 felled black cherry trees. (a) Fit a model with log(Volume) as the response and a second-order polynomial (including the interaction term) in Girth and Height. Show a summary of the results. (b) In the above second-order polynomial model, which term(s) appear to be signicant at the 0.05 level? (Use the summary from the previous part). (c) Determine whether the model may be reasonably simplied. 2 STAT 425 Lecture Notes Model Selection Spring 2015 Consider two models (with the usual assumptions): Y = 0 + 1 X1 + 2 X2 + (1) Y = 0 + 1 X1 + (2) and Model (2) is obtained by dropping X2 from model (1). Let's assume that model (1) is fully correct, and examine how least squares estimation of its 1 is aected by including or omitting X2 (i.e. by using model (1) versus using model (2)). 1 For simplicity, assume that X1 and X2 have been scaled so that n n (xi1 x1 )2 = i=1 (xi2 x2 )2 = 1 i=1 The parameter of interest is the 1 in model (1). Let the least squares estimates be (1) 1 = LS est. of 1 using model (1) (2) 1 = LS est. of 1 using model (2) 2 (1) If model (1) is used (so X2 is included), then, of course, 1 will be unbiased: (1) = 1 E 1 Also, it can be shown that (1) var 1 = 2 2 1 r12 where r12 is the (sample) correlation between X1 and X2 . We will assume |r12 | < 1. 3 Now assume model (2) is used (X2 is omitted). Keep in mind that we still want to estimate the 1 of model (1). If the relationship between X1 and Y ignoring X2 is not the same as the relationship between X1 and Y accounting for (2) X2 , then 1 will be biased, when regarded as an estimate of the 1 of model (1). In fact, if model (1) is correct but model (2) is used, it can be shown that (2) = 1 + r12 2 E 1 and (2) var 1 = 2 (where 2 is the error variance in model (1)). 4 (1) (2) Which of 1 and 1 is the better estimate of 1 ? Is it better to use the correct model (1)? Or is it better to use the possibly incorrect but simplied model (2)? One way to decide is to examine which estimator has the smaller mean square error: M SE(1 ) = E[(1 1 )2 ] = variance + bias2 Model (2), even if it is incorrect, could be regarded as better (2) (1) for estimating 1 if 1 has a smaller M SE than 1 . If model (1) is used, (1) M SE 1 = 2 2 1 r12 If model (2) is used, (2) M SE 1 = 2 + (r12 2 )2 (2) After some algebra, we see that M SE 1 when 1 and |2 |/ < 2 1 r12 (1) < M SE 1 r12 = 0 Surprisingly, model (2) will be always be better when |2 | < , and it will also be better when r12 is suciently close to 1. 6 Conclusion: When using least squares, sometimes using a simpler model for estimation is better (yields smaller M SE), even if that model is incorrect! This is especially true when the simpler model we use omits terms with small coecients. It thus makes sense to select and use only a few of the available X variables those few that we think will substantially contribute to the model. But we have to make the selection carefully ... 7 Model Selection Techniques We will consider three ways of selecting variables for inclusion in multiple regression models. Selection Based on Theory Knowledge of the scientic background, when available, should inform the selection of variables. Sometimes variables must be selected prior to data collection, as in designed experiments. The researcher must know the most critical variables to include and will often have some idea of which multiple regression models will be considered, before examining any data. 8 Stepwise Methods Some methods search for subsets of predictors by sequentially adding or deleting variables. We consider three such methods: Forward Selection (FS): A model is chosen by sequentially adding one variable at a time according to a set of rules until a stopping criterion is met. Backward Elimination (BE): A model is chosen by sequentially deleting one variable at a time according to a set of rules until a stopping criterion is met. Stepwise (SW): At each stage, a variable is either added, deleted, or interchanged with another variable, according to a set of rules until a stopping criterion is met. 9 Criterion Based Selection A statistic is chosen that describes a desirable and quantiable property of a regression model. The subset of predictors that results in the best value of this statistic is then chosen as the nal model, subject to possible substantive modications. 10 Highway Accident Data > library(alr3) # has the highway data set > help(highway) ... Description: The data comes from a unpublished master's paper by Carl Hoffstedt. They relate the automobile accident rate, in accidents per million vehicle miles to several potential terms. The data include 39 sections of large highways in the state of Minnesota in 1973. The goal of this analysis was to understand the impact of design variables, Acpts, Slim, Sig, and Shld that are under the control of the highway department, on accidents. 11 ... ADT average daily traffic count in thousands Trks truck volume as a percent of the total volume Lane total number of lanes of traffic Acpt number of access points per mile Sigs number of signalized interchanges per mile Itg number of freeway-type interchanges per mile Slim speed limit in 1973 Len length of the highway segment in miles Lwid lane width, in feet Shld width in feet of outer shoulder on the roadway Hwy An indicator of the type of roadway or the source of funding for the road; 0 if MC, 1 if FAI, 2 if PA, 3 if MA Rate 1973 accident rate per million vehicle miles 12 Rate will be the dependent variable. Since Hwy is not really a numerical variable (because it denes categories), we exclude it from the analysis. > cormat <- round(cor(highway[,-11]),2) > cormat ADT Trks Lane Acpt Sigs Itg ADT 1.00 -0.10 0.82 -0.22 0.15 0.90 Trks -0.10 1.00 -0.15 -0.36 -0.45 -0.07 Lane 0.82 -0.15 1.00 -0.21 0.25 0.70 Acpt -0.22 -0.36 -0.21 1.00 0.50 -0.20 Sigs 0.15 -0.45 0.25 0.50 1.00 0.07 Itg 0.90 -0.07 0.70 -0.20 0.07 1.00 Slim 0.24 0.30 0.26 -0.68 -0.41 0.24 Len -0.27 0.50 -0.20 -0.24 -0.32 -0.25 Lwid 0.13 -0.16 0.10 -0.04 0.04 0.10 Shld 0.46 0.01 0.48 -0.42 -0.13 0.38 Rate -0.03 -0.51 -0.03 0.75 0.56 -0.02 # don't need Hwy (11th variable) Slim 0.24 0.30 0.26 -0.68 -0.41 0.24 1.00 0.19 0.10 0.69 -0.68 Len -0.27 0.50 -0.20 -0.24 -0.32 -0.25 0.19 1.00 -0.31 -0.10 -0.47 Lwid 0.13 -0.16 0.10 -0.04 0.04 0.10 0.10 -0.31 1.00 -0.04 -0.01 Shld 0.46 0.01 0.48 -0.42 -0.13 0.38 0.69 -0.10 -0.04 1.00 -0.39 Rate -0.03 -0.51 -0.03 0.75 0.56 -0.02 -0.68 -0.47 -0.01 -0.39 1.00 First, we'll consider stepwise approaches to nd a good subset of predictors. We'll begin by describing forward selection (FS). Forward Selection Step 1: Begin with no variables (just an intercept). Step 2: Add the variable to the model that will have the greatest statistical signicance (i.e. smallest p-value), given the variable(s) already included in the model (if any). Repeat until the stopping rule is satised. 14 Three common stopping rules for FS: FS.1 Stop with a subset of predetermined size p. FS.2 Stop if the F -test statistic for each of the variables not yet entered would be less than some predetermined value, say Fin (alternatively, if its p-value is greater than some predetermined value). FS.3 Stop when the next predictor would make the set of predictors too collinear, according to some measure of collinearity. 15 Backward Elimination (BE) is similar, but we start with the full model and remove variables step-by-step. Backward Elimination Step 1: Begin with the full model in all available variables. Step 2: Remove the variable that has the lowest statistical signicance (smallest F -statistic or largest p-value). Continue until the stopping rule is satised. Common stopping rules for BE include: BE.1 Stop with a subset of predetermined size p. BE.2 Stop if the F -test statistic for all variables in the model is bigger than some number Fout . 16 Let's try forward selection and backward elimination with the highway data, using an F value of 3 to dene the stopping rule. We'll use functions called add1 and drop1 that list the consequences of adding and dropping variables. First try forward selection ... 17 > highwaymod <- lm(Rate ~ 1, data=highway) # start with intercept only > indep.vars <- ~ ADT + Trks + Lane + Acpt + Sigs + Itg + Slim + Len + + Lwid + Shld > add1(highwaymod, indep.vars, test="F") Single term additions Model: Rate ~ 1 Df Sum of Sq RSS 149.886 ADT 1 0.122 149.764 Trks 1 39.372 110.514 Lane 1 0.163 149.723 Acpt 1 84.767 65.119 Sigs 1 47.759 102.127 Itg 1 0.092 149.794 Slim 1 69.508 80.378 Len 1 32.449 117.437 Lwid 1 0.005 149.881 Shld 1 22.438 127.449 --Signif. codes: 0 *** 0.001 AIC 54.506 56.474 44.622 56.464 23.994 41.543 56.482 32.204 46.991 56.505 50.182 F value Pr(>F) 0.0302 13.1817 0.0403 48.1636 17.3029 0.0228 31.9962 10.2237 0.0012 6.5139 0.8629277 0.0008505 0.8420245 3.408e-08 0.0001817 0.8806802 1.833e-06 0.0028385 0.9729162 0.0149650 ** 0.01 * 0.05 . 0.1 > highwaymod <- update(highwaymod, . ~ . + Acpt) 18 *** *** *** *** ** * 1 # Acpt has biggest F > add1(highwaymod, indep.vars, test="F") Single term additions Model: Rate ~ Acpt Df Sum of Sq RSS AIC F value Pr(>F) 65.119 23.994 ADT 1 3.0871 62.032 24.099 1.7916 0.189124 Trks 1 10.0532 55.066 19.454 6.5724 0.014678 Lane 1 2.4108 62.708 24.522 1.3840 0.247143 Sigs 1 7.1603 57.959 21.451 4.4475 0.041975 Itg 1 2.4664 62.653 24.488 1.4172 0.241655 Slim 1 7.9430 57.176 20.920 5.0012 0.031618 Len 1 12.9806 52.139 17.323 8.9627 0.004957 Lwid 1 0.1013 65.018 25.933 0.0561 0.814118 Shld 1 0.8293 64.290 25.494 0.4644 0.499945 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 > highwaymod <- update(highwaymod, . ~ . + Len) 19 * * * ** 1 > add1(highwaymod, indep.vars, test="F") Single term additions Model: Rate ~ Acpt + Len Df Sum of Sq RSS AIC F value Pr(>F) 52.139 17.323 ADT 1 0.3062 51.832 19.094 0.2067 0.6521 Trks 1 2.9834 49.155 17.025 2.1242 0.1539 Lane 1 0.3814 51.757 19.037 0.2579 0.6148 Sigs 1 3.4703 48.668 16.637 2.4957 0.1232 Itg 1 0.2262 51.912 19.154 0.1525 0.6985 Slim 1 7.2920 44.847 13.448 5.6910 0.0226 * Lwid 1 0.8546 51.284 18.679 0.5833 0.4502 Shld 1 3.2651 48.873 16.801 2.3383 0.1352 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > highwaymod <- update(highwaymod, . ~ . + Slim) 20 > add1(highwaymod, indep.vars, test="F") Single term additions Model: Rate ~ Acpt + Len + Df Sum of Sq ADT 1 0.93309 Trks 1 2.40652 Lane 1 1.30168 Sigs 1 2.51322 Itg 1 0.87112 Lwid 1 0.38786 Shld 1 0.01982 Slim RSS 44.847 43.913 42.440 43.545 42.333 43.975 44.459 44.827 AIC F value Pr(>F) 13.448 14.628 0.7224 0.4013 13.297 1.9279 0.1740 14.299 1.0164 0.3205 13.198 2.0185 0.1645 14.683 0.6735 0.4175 15.109 0.2966 0.5896 15.431 0.0150 0.9031 All F values are less than 3, so the stopping rule is met. The nal model includes Acpt, Len, and Slim. 21 Now try using the backward elimination (BE) method ... > highwaymod <- lm(Rate ~ . - Hwy, data=highway) # start with all vars > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ (ADT + Trks + Lane + Acpt + Sigs + Itg + Slim + Len + Lwid + Shld + Hwy) - Hwy Df Sum of Sq RSS AIC F value Pr(>F) 38.994 21.994 ADT 1 0.0131 39.007 20.007 0.0094 0.923436 Trks 1 1.7899 40.784 21.744 1.2853 0.266539 Lane 1 0.0296 39.023 20.023 0.0213 0.885132 Acpt 1 11.8141 50.808 30.315 8.4833 0.006965 ** Sigs 1 0.9652 39.959 20.948 0.6931 0.412154 Itg 1 0.0431 39.037 20.037 0.0310 0.861584 Slim 1 1.3365 40.330 21.308 0.9597 0.335657 Len 1 5.2726 44.266 24.940 3.7861 0.061782 . Lwid 1 0.9119 39.906 20.895 0.6548 0.425231 Shld 1 0.9680 39.962 20.950 0.6951 0.411503 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > highwaymod <- update(highwaymod, . ~ . - ADT) 22 # ADT has least F > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ Trks + Lane + Acpt + Sigs + Itg + Slim + Len + Lwid + Shld Df Sum of Sq RSS AIC F value Pr(>F) 39.007 20.007 Trks 1 1.7771 40.784 19.744 1.3212 0.259763 Lane 1 0.0780 39.085 18.085 0.0580 0.811437 Acpt 1 11.9079 50.915 28.397 8.8530 0.005847 ** Sigs 1 0.9786 39.986 18.973 0.7276 0.400661 Itg 1 0.2582 39.265 18.264 0.1920 0.664512 Slim 1 1.4501 40.457 19.430 1.0781 0.307708 Len 1 5.3262 44.333 22.999 3.9598 0.056102 . Lwid 1 0.8988 39.906 18.895 0.6682 0.420343 Shld 1 0.9652 39.972 18.960 0.7176 0.403879 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > highwaymod <- update(highwaymod, . ~ . - Lane) 23 > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ Trks + Acpt + Sigs + Itg + Slim + Len + Lwid + Shld Df Sum of Sq RSS AIC F value Pr(>F) 39.085 18.085 Trks 1 1.8331 40.918 17.872 1.4070 0.244868 Acpt 1 11.9551 51.040 26.493 9.1763 0.005007 ** Sigs 1 1.3821 40.467 17.440 1.0608 0.311255 Itg 1 0.7717 39.857 16.847 0.5924 0.447528 Slim 1 1.5009 40.586 17.554 1.1521 0.291676 Len 1 5.2498 44.335 21.000 4.0295 0.053793 . Lwid 1 0.8615 39.946 16.935 0.6613 0.422525 Shld 1 0.8874 39.972 16.960 0.6811 0.415708 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > highwaymod <- update(highwaymod, . ~ . - Itg) 24 > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ Trks + Acpt + Sigs + Slim + Len + Lwid + Shld Df Sum of Sq RSS AIC F value Pr(>F) 39.857 16.847 Trks 1 1.7312 41.588 16.506 1.3465 0.254752 Acpt 1 11.3358 51.192 24.609 8.8169 0.005715 ** Sigs 1 1.7001 41.557 16.476 1.3223 0.258977 Slim 1 1.5121 41.369 16.300 1.1761 0.286510 Len 1 6.3017 46.158 20.572 4.9014 0.034323 * Lwid 1 0.7972 40.654 15.620 0.6200 0.437015 Shld 1 0.5989 40.456 15.429 0.4659 0.499970 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > highwaymod <- update(highwaymod, . ~ . - Shld) 25 > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ Trks + Acpt + Sigs + Slim + Len + Lwid Df Sum of Sq RSS AIC F value Pr(>F) 40.456 15.429 Trks 1 1.4809 41.936 14.831 1.1714 0.287206 Acpt 1 11.5064 51.962 23.191 9.1014 0.004976 Sigs 1 1.4984 41.954 14.847 1.1852 0.284426 Slim 1 5.6245 46.080 18.506 4.4489 0.042838 Len 1 5.7031 46.159 18.572 4.5111 0.041500 Lwid 1 0.4727 40.928 13.882 0.3739 0.545210 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 > highwaymod <- update(highwaymod, . ~ . - Lwid) 26 ** * * 1 > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ Trks + Acpt + Sigs + Slim + Len Df Sum of Sq RSS AIC F value Pr(>F) 40.928 13.882 Trks 1 1.4051 42.333 13.198 1.1329 0.294883 Acpt 1 11.6253 52.554 21.633 9.3734 0.004356 Sigs 1 1.5118 42.440 13.297 1.2189 0.277556 Slim 1 6.0861 47.014 17.289 4.9072 0.033758 Len 1 5.2305 46.159 16.573 4.2173 0.048007 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 > highwaymod <- update(highwaymod, . ~ . - Trks) 27 ** * * 1 > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ Acpt + Sigs + Slim + Len Df Sum of Sq RSS AIC F value Pr(>F) 42.333 13.198 Acpt 1 12.5355 54.869 21.314 10.0679 0.003195 Sigs 1 2.5132 44.847 13.448 2.0185 0.164500 Slim 1 6.3349 48.668 16.637 5.0879 0.030641 Len 1 9.1881 51.521 18.859 7.3794 0.010299 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 > highwaymod <- update(highwaymod, . ~ . - Sigs) 28 ** * * 1 > drop1(highwaymod, test="F") Single term deletions Model: Rate ~ Acpt + Slim + Len Df Sum of Sq RSS AIC F value Pr(>F) 44.847 13.448 Acpt 1 17.744 62.591 24.449 13.8483 0.0006933 *** Slim 1 7.292 52.139 17.323 5.6910 0.0225972 * Len 1 12.330 57.176 20.920 9.6225 0.0037874 ** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Now all F values are at least 3, so the stopping rule is met. The nal model includes the variables Acpt, Slim, and Len. (This happens to be the same model found by forward selection, but in general the two methods could produce dierent nal models.) 29 Stepwise Selection The stepwise (SW) algorithm begins as in forward selection with no X variables (only an intercept). Then at each step it considers four alternatives: add a variable, delete a variable, exchange two variables, or stop. 30 The rules for SW: SW.1 If there are at least two variables in the model, and one or more has F value less than Fout , remove the variable with the smallest F value. SW.2 [ Optional ] If there are two or more variables in the model, remove the one with the smallest F value if its removal results in a value of R2 that is larger than an R2 previously obtained for the same number of variables. SW.3 [ Optional ] If two or more variables are in the model, exchange one of them with a variable not in the model if the exchange increases R2 . SW.4 Add a variable to the model if it has the highest F value, as in FS, provided the F value is greater than Fin . 31 Generally, we terminate the stepwise algorithm when, after applying all of the rules, the resulting model is no dierent from one that has already been considered. Stepwise selection (without the optional rules) can be performed in R by iteratively using both the add1 and drop1 functions. (Application to the Highway data is not demonstrated here, but its results would be equivalent to forward selection, in this case.) 32 Criterion based selection Criterion based selection involves identifying a statistic that describes the quality of a subset of predictors, and then nding the subset that optimizes this statistic. One reasonable approach is to dene a statistic that measures the mean square error of estimating E(Y ) at a set of points of interest (such as the X variable combinations in the data). (A subset model might produce biased estimates, so we use mean square error rather than merely variance.) 33 Dene Jp for a given model with p predictors as follows: 1 Jp = 2 n M SEp (i ) y i=1 Several estimates of Jp have been proposed, and one that seems to be most popular is Mallow's Cp : Cp = RSSp + 2p n 2 where 2 is obtained from the full model, and RSSp is the RSS from the p-predictor model. One method: Choose the subset of predictors that minimizes Cp . 34 Properties of Cp 1. Cp depends only on the usual statistics RSSp , 2 , p, and n. 2. Cp has a random term RSSp that penalizes lack of t, and 2 a xed term 2p that penalizes for including too many predictors. 3. For a subset model, if all of the predictors that are left out have coecients near 0, the expected value of Cp is approximately equal to p . 35 The R function leaps returns the best model (according to Cp ) of each possible size. > library(leaps) > x <- model.matrix(Rate ~ . - 1 - Hwy, data=highway) # no intercept here > y <- highway$Rate > bestmods <- leaps(x, y, nbest=1) # intercept automatically added here > bestmods $which 1 2 3 4 5 6 7 8 9 A 1 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE 2 FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE 3 FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE 4 FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE 5 FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE 6 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 7 FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE 8 TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE 9 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 10 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ... $Cp [1] 11.759635 [8] 7.046418 4.438763 1.202634 9.009404 11.000000 1.397983 36 2.389045 4.023979 5.619545 The function Cpplot, from the faraway package, plots Cp versus p , using all of the models produced by leaps: > library(faraway) > Cpplot(bestmods) It also plots the Cp = p line: Models near or below the line are adequate (but some may have too many variables). 37 12 4 10 12345678910 8 2345678910 6 Cp 124578910 24578910 48 4 245678 2 24578 478 2 4 4578 6 8 p 38 10 > colnames(x)[c(4,7,8)] [1] "Acpt" "Slim" "Len" So the best (smallest Cp ) model was the same as the one found by forward selection and backward elimination. You might have noticed a statistic called Adjusted R2 on your R output. Like R2 , this statistic measures the tightness of the t, except that it adjusts for the number of predictors in the model. (Unlike R2 , adding a predictor does not always increase it.) Let R2 denote adjusted R2 : R2 = 1 n1 (1 R2 ) np 39 An alternative to using add1 and drop1 is to use the step function. It can do forward selection, backward elimination, or stepwise selection, but it uses a criterion called AIC (which gives results similar to Cp in linear regression with normal errors). Let's look at a backward elimination example using step: > backstep <- step(lm(Rate ~ . - Hwy, data=highway), direction="backward") ... > summary(backstep) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.81443 2.60435 3.385 0.00181 ** Acpt 0.08940 0.02818 3.173 0.00319 ** Sigs 0.48538 0.34164 1.421 0.16450 Slim -0.09599 0.04255 -2.256 0.03064 * Len -0.06856 0.02524 -2.717 0.01030 * ... 40 Remarks: When using variable selection on polynomial models, you should respect the hierarchy of the terms: If a higher-degree term is included in the model, then every lower degree term it contains must also be included in the model. 2 For example, if an interaction term like X1 X2 is included, then the model must also include terms for X1 , X2 , 2 X1 X2 , and X2 , even if they don't appear to be signicant. 41