Question: COVER PAGE STAT 608 Homework 05 Summer 2017 Please TYPE your name and email address below, then convert to PDF and attach as the first
COVER PAGE STAT 608 Homework 05 Summer 2017 Please TYPE your name and email address below, then convert to PDF and attach as the first page of your homework upload. NAME: EMAIL: HOMEWORK NUMBER: STATISTICS 608 Homework 608 S17 05 Due: 11:59 PM, July 8, 2017 Question 1 [4] In Exercise 3, page 105 in the textbook you were instructed to use a logarithmic transformation for Ad Revenue and a square root transformation for Circulation. What would be the appropriate transformations according to the Box-Cox technique? (You are not required to implement these transformations.) Question 2 [6] This question is related to Exercise 5 on page 224 in the textbook. Assume that log(Y ) is an appropriate transformation of the response variable. Use marginal model plots to evaluate the ...t of the full seven-covariate model. Describe briey weaknesses, if any, in the model that these plots reveal. Question 3 [1+4+4+2] In the simple linear regression model yj = 0 + 1 xj + ej ; j = 1; : : : ; n, the predicted values are de...ned by y^j = ^0 + ^1 xj where ^0 and ^1 denote the least squares estimators of 0 and 1. 3.1 Show that the mean of the y - values, y, equals the mean of the predicted values, y^. 3.2 Show that SSreg = where X (^ yi y^ = n 1 1 y^)(yi X y^i : y) 3.3 Hence, show that the statistic R2 = SSreg SST equals the square of the Pearson correlation coe cient between the 2 pairs (yj ; y^j ); j = 1; : : : ; n: [Hint: ab = aba for a; b 6= 0.] 3.4 It can be shown (you don't have to) that the result in Question 3.2 is also true in a general linear model setup with m (< n) covariates. Suppose m = 20 and that you attempt to select the best subset of covariates by maximizing R2 . How many covariates will be in your best subset? Justify your answer. 2 + Stat 608 Chapter 7 Variable Selection + Introduction Overspecified model (or contains irrelevant predictors): MSE: fewer degrees of freedom. Standard errors for regression coefficients inflated. Thus: larger p-values and wider confidence intervals. Underspecified model (too few predictors): Regression coefficients and thus predictions are biased. Arguably worse than overspecified model. 2 + Introduction Problems with multicollinearity: Even when the model is significant, it's possible that no individual predictors are significant. Slopes may have the wrong sign. Predictors that explain substantial variation in y may be insignificant. 3 + Example: Bridge data Bridge model, predicting log(Time): Estimate Std. Error t value Pr(>|t|) (Intercept) 2.28590 0.61926 3.691 0.000681 *** log(DArea) -0.04564 0.12675 -0.360 0.720705 log(CCost) 0.19609 0.14445 1.358 0.182426 log(Dwgs) 0.85879 0.22362 3.840 0.000440 *** log(Length) -0.03844 0.15487 -0.248 0.805296 log(Spans) 0.23119 0.14068 1.643 0.108349 4 + Confidence interval: deck area log(DArea) -0.04564 qt(0.975, 39) [1] 2.022691 0.12675 -0.360 0.720705 5 + Hypothesis test: deck area log(DArea) -0.04564 0.12675 -0.360 0.720705 6 + Introduction Goal: Choose the best model using variable selection methods. Start by considering the full model containing all m potential predictor variables: Variable selection methods choose the subset of predictors that is \"best\". Overfitting: including too many predictors (model performs as well or worse than simpler models at predicting new data) Underfitting: including too few predictors (model doesn't perform as well as models with more predictors) 7 + Introduction If the goal is interpretation, simpler models are usually preferred. Use a method that chooses fewer models. If the goal is prediction, more variables may be acceptable. 8 + Forward, Backward, and Stepwise Subsets If there are m variables, there are 2m possible regression equations. If m is small enough, run all of them (all possible subsets). Backward, Forward, and Stepwise selection procedures examine only some of the 2m possible regression equations. Backward elimination: 2. All variables are included in the model. The predictor with the largest p-value is deleted (as long as it isn't significant). The remaining m-1 variables are now in the model. Again, the predictor with the largest p-value is deleted (as long as it isn't significant). 3. Variables are deleted until all remaining variables are significant. 1. 9 + Backward selection =0.05 Model 1: Full model X1 X2 X3 X4 X5 P-value < 10 + Backward selection =0.05 Model 2: Full model minus variable with largest p-value X1 X2 X3 X4 X1 X2 X3 X4 X5 P-value < 11 + Backward selection =0.05 X1 X2 X3 X4 X1 X2 X3 X4 X2 X3 X4 X5 P-value < Model 3: Final model: x2, x3, x4. 12 + Forward, Backward, and Stepwise Subsets Forward selection: 1. No variables are in the model. All m models with only one predictor are run. The predictor with the smallest p-value is entered in the model (as long as it is significant). Call this variable x 1. 2. All models with predictors x1 and only one other predictor are run; of the remaining predictors x2, ..., xm, the one with the smallest pvalue is entered (as long as it is significant). 3. Variables are entered until no more predictors are significant, given the others already in the model. 13 + Forward selection Step 1: Enter the first variable X1 Model 1 X2 Model 2 X3 Model 3: Smallest p-value X4 Model 4 X5 Model 5 =0.05 14 + Forward selection Step 2: Enter the second variable X3, X1 Model 1: X1 has smallest p-value X3, X2 Model 2 X3, X4 Model 3 X3, X5 Model 4 =0.05 15 + Forward selection Step 3: No more variables are significant. X3, X1, X2 Model 1 X3, X1, X4 Model 2 X3, X1, X5 Model 3 =0.05 16 + Stepwise Subsets Stepwise Selection Procedure: 1. Choose E and R, significance levels to Enter and Remove predictors. 2. Forward step: No variables are in the model. All models with one predictor are run. The predictor with the smallest p-value is entered into the model, as long as the pvalue is less than E. Call this variable x1. 3. Forward step: All models with predictors x1 and only one other predictor are run; of the remaining predictors x2, ..., xp, the one with the smallest p-value is entered, as long as the p-value is less than E. 4. Backward step: Check to see that the p-value for variable x 1 is smaller than R. If not, remove it. If so, leave it in. 5. Take another forward step, attempting to add a third variable. 6. Continue taking backward and forward steps until adding an additional predictor does not yield a p-value below E. Could E be larger than R? Vice versa? Stepwise is a forward selection procedure, except that a variable can be removed once it is in. 17 + Forward, Backward, and Stepwise Subsets These procedures only consider some of the predictors, so they do not necessarily find the model that fits the data the best among all possible subsets. Forward, backward, and stepwise may not produce the same final model, though they often do. If covariance of the predictors = 0, all three produce the same final model. These methods are prone to overfitting, but stiff criteria for adding or deleting variables can mitigate this problem. Shouldn't we just remove the insignificant terms all at once? Chapter 5: F-Test for model reduction Chapter 7: Algorithms (not hypothesis tests) 18 + Selection Criteria: (1) R2 -Adjusted Adding irrelevant predictor variables to the regression model often increases R2. To compensate, we adjust for the number of predictors: Choose the subset of the predictors that has the highest value of R2adj. This is equivalent to choosing the subset of the predictors with the lowest value of MSE (mean square error). 19 + Selection Criteria: (2) AIC (Akaike's Information Criterion) Based on maximum likelihood estimation R uses the calculation: Choose the model which makes AIC as small as possible. (By small, we mean close to -). Only meant to compare sub-models to one another or to the full model, not models with different transformations. 20 + Selection Criteria: (3) AICC (AIC Corrected) Corrects for bias when n small or p large compared to n. (AIC tends to overfit; the penalty for model complexity is not strong enough.) Converges to AIC as n increases. Choose the model which makes AICC as small as possible. IMPORTANT NOTE: The formula above is correct; the textbook is incorrect on page 231. See www.stat.tamu.edu/~sheather/book/docs/Errata.pdf. 21 + Selection Criteria: (4) BIC (Bayesian Information Criterion, aka SBC) Based on posterior probability of model, but often used in a frequentist sense. Choose the model which makes BIC as small as possible. BIC is similar to AIC except with 2p replaced by p log(n). When n 8, log(n) 2, so the penalty term for BIC is larger than the penalty term for AIC. BIC favors simpler models than AIC. 22 + Selection Criteria: (5) Mallows' Cp Uses unbiasedness as a criterion for choosing a model; assumes the full model is unbiased. Choose a model whose Cp value is close to the _______________ ________________ in the model counting the intercept. (Err on the side of a smaller value of Cp.) Don't use Cp to choose the full model; Cp always equals p in that case. If the full model contains a large number of insignificant variables, MSEfull will be inflated (MSE involves the df). Then Cp is not an appropriate model for choosing the best model. 23 + Comparison of Selection Criteria Using p-values tends toward extreme over-fitting. (After doing 3 hypothesis tests, overall alpha increases from 0.05 to about 0.1...) R2adj tends toward over-fitting. Pro of AIC and AICC: They are \"efficient.\" Asymptotically, the error in prediction from the model using AIC and AICC is no different from the error from the best model. Not true of BIC. Pro of BIC: The probability it selects the correct model is asymptotically 1. Not true of AIC. AIC chooses models too complex when n is large. BIC chooses models too simple when n is small. 24 + Comparison of Selection Procedures All possible subsets: If the number of predictors in the model is of fixed size p, all four criteria R2adj, AIC, AICC, and BIC choose the same model. When comparing models with different numbers of predictors, we can get different answers. Forward, Backward, and Stepwise: Using other information criteria (AIC, BIC) to select a model is equivalent to using p-values to add and remove variables; the difference is where the algorithm stops. 25 + Reminders The regression coefficients obtained after variable selection are biased. P-values from these models are generally much smaller than their true values. Software treats each column of the design matrix as being completely separate, ignoring relationships in polynomial models and models with interaction terms. 26 + Bridge Data Subset Size 1 2 3 4 5 Predictors R2adj AIC AICC BIC log(Dwgs) log(Dwgs), log(Spans) log(Dwgs), log(Spans), log(Ccost) log(Dwgs), log(Spans), log(Ccost), log(Darea) log(Dwgs), log(Spans), log(Ccost), log(Darea), log(Length) 0.702 0.753 0.758 0.753 -94.90 -102.37 -102.41 -100.64 -94.31 -101.37 -100.87 -98.43 -91.28 -96.95 -95.19 -91.61 0.748 -98.71 -95.68 -87.87 27 + LASSO LASSO: Least Absolute Shrinkage and Selection Operator, performs variable selection and parameter estimation simultaneously. Constrained Least Squares: for some number s non-negative. When s is very large, this is equivalent to the usual least squares estimates for the model. When s is small, some of the coefficients are 0, effectively removing them from the model. 28 + Assessing the Predictive Ability of Regression Models Since regression coefficients are biased and p-values are generally much smaller than their true values, we need another approach: Split the data, and see how well models built on one part predict the other part not being used to build the model. 29 + Assessing the Predictive Ability of Regression Models Full Data Usually give more to training data Training Data Create models Test Data Compare models 30 + Assessing the Predictive Ability of Regression Models Ideally, the training and test data sets will be similar with respect to: Univariate distributions of each of the predictors and response Multivariate distributions of all variables Means, variances, other moments Outliers Usually, splitting the data is done randomly. However, especially in small data sets, the above criteria are not always met. 31 + Chapter 8 Logistic Regression 1 + Introduction and Setup 2 + Linear Models? Recall: a linear model is one that can be written in matrix form as That is, we can express y as a linear combination of the parameters and the error term. Models with transformed response variables are not linear models. For example, can be rewritten as (The e in the exponent is the error term; all others are the exponential function.) The relationship between y and the parameters is nonlinear; note the error term is also multiplicative. Logistic regression models (this chapter) are also not linear models. 3 + Logistic Regression So far: response variable - quantitative Chapter 8: response variable - categorical X = HSGPA, Y = Accepted to TAMU X = Amount of credit card transaction, Y = Fraudulent (Y/N) Ideally such responses follow a binomial distribution in which case the appropriate model is a logistic regression model. 4 + Logistic Regression Response Variable Response Variable Quantitative Categorical Explanatory Variable Quantitative Regression Logistic Regression Explanatory Variable Categorical ANOVA 2 Tests / Logistic Regression Explanatory Variables Categorical & Quantitative ANCOVA Logistic Regression 5 + Logistic Regression Why not use our usual linear regression methods, creating a dummy variable for y, and fitting a least squares regression line between x and y? 6 + Logistic Regression Why not use the usual line? 1. Possible predictions out of bounds 2. Nonconstant variance: If the response Y has a Bernoulli (binomial with n = 1) distribution: 3. E[Y | X=x] = p(x) (This means the probability of success p is a function of the explanatory variable x: p changes with x.) Var(Y | X=x) = p(x) (1 - p(x)) That is, the variance of Y changes with x when the mean changes. When the response Y has a Bernoulli distribution, the logistic regression model correctly models the mean. 7 + Logistic Regression How do we turn a binary variable into something continuous? What about using proportion of successes? What about using odds of success? 8 4 From Probability to Log Odds -4 Log Odds -2 0 2 + 0.0 0.2 0.4 0.6 Probability 0.8 1.0 9 + From Probability to Log Odds Probability Odds Log(Odds) 0.1 0.111 -2.2 0.25 0.333 -1.1 0.5 1 0 0.75 3 1.1 0.9 9 2.2 * Logit(p) = log odds (p) = 10 + Example: Bird Extinctions on Islands 11 + Example: Bird Extinctions on Islands 12 + Example: Bird Extinctions on Islands Model: X = Area, Y = 1 if extinct, 0 if not extinct First category in alphabetical order = failure, or 0 = failure. Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.19620 0.11845 -10.099 < 2e-16 *** xnew -0.29710 0.05485 -5.416 6.08e-08 *** 13 + Example: Bird Extinctions on Islands If an island has 50 km2, what is the estimated probability that a species will go extinct there? The estimated probability that a species will go extinct on an island with 50 km2 is 0.086. 14 + Example: Bird Extinctions on Islands 15 + Example: Bird Extinctions on Islands 16 How 0 Affects Model 0.8 + 0.4 0 as labeled, 1 = 1 0.0 1 -3 -2 0 -1 -1 0 x 1 2 3 0 determines the vertical location of the curve. = 0.5 when x = -0/1 17 How 1 Affects Model 0 = 0.5, 1 as labeled 0.4 0.8 + 0.0 1 -3 -2 2 -1 3 0 x 1 2 3 The larger |1|, the steeper the slope . If 1 > 0, the model has a positive slope. At = 0.5, the slope of the model predicting probabilities is 1/4. 18 + Interpret Slope A study on cereal attempted to predict the probability that a cereal would be classified as a children's cereal rather than adults' cereal, based on its grams of sugar per serving. 19 + Interpret Slope Usually, we interpret the slope as the predicted amount of increase or decrease in y for a one-unit increase in x. What happens to the model when x increases by one unit? Compare the above to: So the odds of being a children's cereal are predicted to be multiplied by exp(0.158) = 1.17 when one gram of sugar is added to the cereal, so the odds are 17% higher. That is, the odds increase by a multiplicative factor of 1.17. 20 + Ratios Ratios larger than 1 have a larger numerator; ratios smaller than 1 have a larger denominator. The ratio of male to female students at A&M (Spring semester 2016 data) is 28,717 / 26,375 = 1.09. That is, there are about 9% more men than women at A&M. The ratio of female to male students is 0.92; there are about 92% as many women as men. As with odds, ratios aren't symmetric. 21 + What is an Odds Ratio? Physicians Health Study Aspirin Heart Attack 139 No HA 54,421 Total 54,560 Placebo 239 54,117 54,356 Total 378 108,538 108,916 Define success as having a heart attack. Aspirin = 0.0025, Placebo = 0.0044 OddsAspirin = 0.0026, OddsPlacebo = 0.0044 22 + What is an Odds Ratio? Physicians Health Study Interpret: Our model predicts the odds of having a heart attack to be about half as high if male physicians take an aspirin a day than if they don't. Notice the interpretation of a multiplicative effect rather than an additive effect: The odds of having a heart attack are multiplied by about when taking aspirin. 23 + Odds Ratios In a study of a disease in humans, researchers used the independent variable ethnicity to predict probability of the disease. The model was: The variable encoding scheme was as follows: x1 x2 x3 White Non-Hispanic 0 0 0 Black 1 0 0 Hispanic 0 1 0 Other 0 0 1 24 + Odds Ratios Below is partial output from the model: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) x1 X2 X3 -1.3863 2.0794 1.7918 1.3863 0.5000 0.6325 0.6455 0.6708 -2.773 3.288 2.776 2.067 0.00556 0.00101 0.00551 0.03878 Report a 95% confidence interval for the odds ratio comparing Hispanic to White. What is the odds ratio for comparing Hispanic to Black? 25 + Theory: Part A - Multiple values of y at every value of x 26 + 8.1: A sample of size mi at every observed value of xi Binomial distribution: 1. There are m identical trials 2. Each trial has only one of two outcomes (Success or Failure) 3. The probability of success is the same for all trials 4. Trials are independent Let Y = number of successes in m trials of a binomial process. Then Y is binomial with parameters m and . We write: The probability that there are j successes in m trials (j = 0, 1, ..., m) is given by: 27 + 8.1: A sample of size mi at every observed value of xi Mean and variance of Y: In Section 8.1, we assume we have one predictor variable x with mi measurements at each level of x. In this case: Notice that we write the probability of success as a function of the value of the predictor variable x. 28 + 8.1: A sample of size mi at every observed value of xi Why not use proportions instead of odds? Proportion of successes = It could be the response since it is an unbiased estimate of (x i) and it varies between 0 and 1. BUT: Calculate the mean and variance of the sample proportion: Variance is not constant! We need our log odds model. 29 Relationship between and (1 - ) 0.00 Prob.(1 - Prob.) 0.10 0.20 + 0.0 0.2 0.4 0.6 Probability 0.8 1.0 30 + Finding Parameter Estimates The Likelihood function, denoted L, is the probability of the data, regarded as a function of the unknown parameters with the data values fixed. (Remember when we used least squares, we modeled the sums of squares as a function of the parameters with the data held fixed for regular regression.)1 We rearrange what is fixed and what is considered random to think of likelihood as the likelihood of that value being the parameter, given the data that we have currently. 1 Stat 2: Building Models for a World of Data. Cannon, Ann R. et al. Freeman (2013) 31 + Finding Parameter Estimates The parameters for the logistic regression model are found by maximizing the log-likelihood. This is equivalent to minimizing the deviance, -2 log L. For linear regression models, minimizing RSS had closed form solutions; for logistic regression models, we need an iterative method to find estimates such as Newton-Raphson or iteratively reweighted least squares. 32 + Likelihood 33 + Hypothesis Tests & Confidence Intervals The standard error of the parameter estimates is based on the information statistic, the second derivative of the log likelihood. (Intercept) xnew Estimate Std. Error z value Pr(>|z|) -1.19620 0.11845 -10.099 < 2e-16 *** -0.29710 0.05485 -5.416 6.08e-08 *** Test whether the area of an island is associated with whether a species goes extinct. (Wald test: note z, not t!) 34 + Hypothesis Tests & Confidence Intervals A 95% confidence interval for the parameter 1 is: -0.297 1.96 (0.055) (-0.405, -0.190) exp(-0.405) = 0.67 exp(-0.190) = 0.83 Interpretation: I am 95% confident that when log land area increases by 1, the odds that a species goes extinct on that island are between about 2/3 and 4/5 as large. (Or say multiplied by between 0.67 and 0.83 (the odds decrease).) 35 + Residuals? 36 + Deviance In linear models, we measure how well our model works in part by how close the data are to the model. The concept of residual sums of squares is replaced by deviance for logistic regression. The saturated model is one with a separate proportion of successes for every value of xi; that is, Ex: To estimate the number of species that would go extinct on an island with 185.8 square miles, I could use what happened on Ulkokrunni (5 went extinct), which is the saturated estimate. Or I could use the model to predict: 4.5. = -2.748 # extinct = (75 species) (0.06) = 4.5 species. 37 + Deviance 38 + Deviance Deviance measures the difference between the log likelihood from the saturated model (S) and the log likelihood from our model (M). yi: Raw number that actually went extinct = Predicted number that went extinct from logistic model = 39 + Deviance Island Logistic Saturated: Model: yi/mi -hat Area SpeciesRisk Ext Ulkokrunni 185.8 75 5 0.067 0.060 Maakrunni 105.8 67 3 0.045 0.070 Ristikari 30.7 66 10 0.152 0.099 Isonkivenletto 8.5 51 6 0.118 0.138 40 + Deviance When each mi is large enough, the deviance statistic can be used as a chisquared goodness-of-fit test for the logistic regression model. We wanted residual sums of squares to be small because we wanted the model to fit the data well. We also want deviance to be small because we want the model to fit the data well. If deviance is large, it means the saturated model has a very different fit to the data from our model of interest. Ho: The logistic regression model is appropriate. Ha: The logistic regression model is inappropriate. G2 has the 2 distribution with n - p - 1 degrees of freedom, where n is the number of binomial samples, not the total sample size! Beware sample sizes that are too large. 41 + Deviance The model summary output gives: Null deviance: 45.338 on 17 degrees of freedom Residual deviance: 12.062 on 16 degrees of freedom The p-value is found by P(G2 > 12.062) from a chi-squared distribution with 18 - 1 - 1 = 16 degrees of freedom: 0.74. 42 + Deviance We can also use deviance to test whether two nested models are significantly different. For example, we could test whether our model is equivalent to the null model: H0: 1 = 0 vs. Ha: 1 0 The difference between the deviances for the two models is compared to a 2 distribution with the difference between the degrees of freedoms for the two models. This doesn't give the same result in general as the Wald z-test for whether 1 = 0. 43 + Deviance Null deviance: 45.338 on 17 degrees of freedom Residual deviance: 12.062 on 16 degrees of freedom 45.338 - 12.062 = 33.276, df = 17 - 16 = 1 p-value < 0.0001 We have strong evidence that log area of an island is somewhat associated with whether a species goes extinct. Notice that the p-value for the Wald test is not the same as that of the deviance test. 44 + Deviance: R2 for logistic regression Recall that for linear regression: Since the deviance is a generalization of the residual sum of squares in linear regression, one version of R2 for logistic regression is: So for the cereal data, R2 = 1 - 12.062 / 45.338 = 0.73. This still has the issue of increasing when we add useless predictors to the model. 45 + Pearson goodness-of-fit statistic Alternative measure of deviance: Pearson 2 statistic Same degrees of freedom as deviance: df = n - p - 1. Same requirement: the statistic has the 2 distribution as long as each of the mi are large enough. If this is true, G2 and 2 are similar; we prefer G2 if they yield different conclusions.1 McCullagh & Nelder (1989) p. 398: Distributional properties for deviance residuals are closer to the residuals from a Gaussian linear regression model. 46 + Residuals for Logistic Regression Three Types of Residuals: 1. Response residuals 2. Pearson Residuals Deviance Residuals 3. 47 + Response Residuals Response residuals are the difference between the observed and the fitted proportions: where is the ith fitted value from the logistic regression model. The variance of yi/mi is not constant, so response residuals are difficult to interpret in practice. 48 + Pearson Residuals The problem of nonconstant variance is overcome by Pearson residuals, the square root of the individual contributions to the Pearson 2 statistic. 49 + Standardized Pearson Residuals Pearson Residuals still don't account for the variance of the model estimate , so we correct for that: 50 + Deviance Residuals Deviance residuals are to the deviance statistic G2 as Pearson residuals are to the 2 Pearson statistic. Standardized deviance residuals are defined to be: 51 + Which residuals are best? Pearson residuals are most popular, but deviance residuals are actually preferred; their distribution is closer to that of least squares residuals. 52 + Theory: Part B - One value of y at every value of x 53 + 8.2 Binary Logistic Regression It is more common that we have only one observation at many values of the predictor variable. (E.g. many predictors!) Such data are called binary. Goodness of fit measures are problematic and plots of residuals are difficult to interpret. (Two U-shaped curves!) 54 + Compare Fits If we fit data with the assumption that all the mi equal 1, the parameter estimates, standard deviations, Wald z-scores, and pvalues are all equivalent. The difference is in the deviance: Binomial Fit: Null deviance: 45.338 Residual deviance: 12.062 on 17 on 16 degrees of freedom degrees of freedom Binary Fit: Null deviance: 578.01 on 631 degrees of freedom Residual deviance: 544.74 on 630 degrees of freedom AIC values are also different. 55 + Binary Deviance In the case that mi = 1, the log likelihood function is: So for the saturated model, the log-likelihood function is: If yi = 0: If yi = 1: 56 + Binary Deviance So in the case where mi = 1, the deviance between the saturated model and the current model only depends on log(LM). Deviance doesn't provide an assessment of the goodness-of-fit of the model! It also doesn't have a 2 distribution. However, we can use deviance to compare two models; the difference between two deviances still has an approximate 2 distribution. 57 Stdzd Pearson Residuals -4 -2 0 2 4 Stdzd Deviance Residuals -4 -2 0 2 4 + Binary Residuals 0 5 15 Food Rating 0 5 15 Food Rating 58 + Binary Residuals Residual plots are problematic when the data are binary. Instead of examining residual plots, compare the fitted model to a nonparametric fit. 59 + Binary Residuals 60 + Transformations, Marginal Model Plots, Outliers 61 + Transformations Do we need to transform y? Why or why not? Reasons to transform x: Linear or quadratic relationships between x's and log odds. Constant variance 62 + Transforming Predictors for Binary Data Why transform predictor variables? Quick review: Suppose 30% of Dalmatians are deaf. If I randomly select 1 Dalmatian, how many are expected to be deaf? 63 + Transforming Predictors for Binary Data: Binary Predictor First suppose the predictor is a dummy variable: Take logs of both sides: 64 + Transforming Predictors for Binary Data: Continuous Predictor When X is binary: Similarly, when X is continuous: We ignore the first term either way when discussing transformations of X. 65 + Transforming Predictors for Binary Data: Normal Predictor When f(x|Y = j), j = 0, 1, is a normal density (possibly with two different means and variances): Then that piece of the log odds we were worried about: 66 + Transforming Predictors for Binary Data: Normal Predictor Conclusions: 1. When x is normal, log odds are a quadratic function of x. 2. When the variances are equal, the log odds is a linear function of x with: 67 + Transforming Predictors for Binary Data: Multivariate Normal After some more math... when we have p predictors that are multivariate normal, with different covariance matrices for Y= 0 and Y = 1, then the log odds are a function of xi, xi2, and xixj (i, j = 1, ..., p; i j). 1. If the variance of xi is different for Y = 0 and Y = 1, add a quadratic term in x i. 2. If the regression of xi on xj has a different slope for Y = 0 and Y = 1, add the interaction xixj. 68 + Interactions Recall: an interaction between xi and xj means the relationship between xi and y is different depending on the value of xj. That means the relationship between xi and xj will be different depending on the value of y. Plot xi and xj, fitting separate slopes for the values of y (0 and 1). The farther apart the slopes, the more important it is to fit an interaction. 69 + Transforming Predictors for Binary Data: Poisson Predictor Distribution of X (possibly with different means again): Starting over again with the piece of the log odds that varies with X: Again, we end up with log odds being a linear function of x. 70 + Marginal Model Plots Residual plots are difficult to interpret; instead we use marginal model plots. Same concept as for multiple linear regression: compare nonparametric estimates of (for every variable xi): If they agree, we conclude that xi is modeled correctly by our model. If not, then xi is not modeled correctly by our model. 71 + Leverage Obtained from weighted least squares approximation to the MLEs. Average leverage = (p+1)/n; cutoff = 2(p+1)/n. 72 SSreg = = = P (^ yi P (^ yi P (^ yi y)2 y)f(^ yi y)(^ yi yi ) + (yi yi ) + y)g P (^ yi y^)(yi y): What remains to show is that the ...rst of these two summations is zero. For this, begin by writing P P ^ (^ yi y)(^ yi yi ) = ( 0 + ^ 1 xi y)^ ei = P (y ^ 1 x + ^ 1 xi P = ^ 1 (xi y)^ ei x)^ ei and continue ... . On the other hand, if you think in terms of the geometric picture, it is "obvious" that the ...rst of the two summations is zero, because ... . 1
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
