Question

1 Approved Answer

Posted on Jun 25, 2024

This question involves logistic regression analysis of the Pima data set in R on risk factors for diabetes among Pima women. Your training and holding

This question involves logistic regression analysis of the Pima data set in R on risk factors for diabetes among Pima women. Your training and holding data sets will be subsets of the Pima.tr and Pima.te data sets in the library MASS. The binary response variable is "type" (type=Yes for Diabetes, type=No for no diabetes). Get your training set and holdout set with the following in R. itrain=c(88, 56, 115, 4, 75, 31, 94, 124, 198, 112, 41, 87, 58, 110, 159, 90, 179, 156, 71, 68, 59, 55, 2, 183, 150, 137, 135, 16, 131, 186, 70, 133, 12, 61, 85, 111, 153, 84, 199, 9, 51, 196, 154, 200, 122, 119, 30, 106, 97, 74, 47, 151, 99, 26, 14, 168, 73, 128, 175, 40, 44, 93, 42, 142, 34, 77, 50, 157, 60, 178, 116, 103, 82, 140, 126, 96, 23, 86, 191, 80, 108, 127, 7, 185, 45, 194, 95, 5, 148, 114, 160, 144, 48, 19, 35, 147, 24, 53, 64, 121, 184, 29, 158, 193, 161, 65, 36, 192, 72, 164, 13, 173, 17, 63, 102, 130, 33, 139, 52, 180, 1, 28, 146, 152, 91, 37, 149, 138, 89, 167, 197, 21, 38, 172, 20, 18, 49, 27, 190, 54, 11, 136, 109, 57, 46, 62, 170, 78, 107, 104, 98, 195, 171, 8, 177, 67, 79, 15, 120, 182, 92, 129, 166, 181, 163, 69, 83, 22, 10, 32) ihold=c(122, 183, 194, 6, 298, 44, 182, 90, 325, 88, 24, 131, 229, 149, 221, 134, 112, 233, 57, 145, 155, 327, 283, 234, 97, 95, 254, 310, 230, 332, 64, 235, 247, 291, 166, 271, 317, 34, 16, 66, 312, 202, 41, 299, 110, 248, 227, 38, 111, 3, 48, 210, 274, 308, 75, 199, 240, 174, 259, 215, 326, 208, 157, 26, 156, 224, 96, 158, 201, 251, 23, 193, 150, 85, 292, 53, 218, 55, 11, 301, 127, 216, 239, 191, 209, 104, 86, 173, 237, 222, 255, 39, 175, 65, 139, 135, 129, 137, 33, 159, 285, 148, 192, 319, 107, 320, 19, 288, 138, 59, 205, 92, 232, 45, 200, 287, 91, 62, 40, 290, 165, 67, 132, 214, 295, 73, 226, 187, 212, 228, 204, 123, 181, 176, 231, 275, 37, 29, 78, 21, 100, 116, 109, 115, 68, 42, 15, 144, 1, 177, 141, 171, 249, 14, 170, 119, 180, 5, 270, 152, 314, 322, 323, 315, 32, 20, 265, 258, 302, 304, 54, 185, 60, 241, 136, 154, 61, 130, 52, 243, 18, 101, 256, 69, 151, 102, 313, 70, 108, 294, 278, 82, 35, 250, 105, 124, 74, 13, 198, 164) library(MASS) data(Pima.tr) mytrain=Pima.tr[itrain,] data(Pima.te) myhold=Pima.te[ihold,] Next do the following: (1) Fit the logistic regression model with all 7 explanatory variables npreg, glu, bp, skin, bmi, ped, age. Call this model 1. (2) Fit the logistic regression model with 4 explanatory variables glu, bmi, ped, age (this is best model from backward elimination if all cases of Pima.tr is used). For this model with 4 explanatory variables, call it model 2. (3) Apply both models 1 and 2 to the holdout data set and get the predicted probabilities. Classify a case as diabetes if the predicted probability exceeds (>=) 0.5 and otherwise classify it as non-diabetes . (4) For models 1 and 2, get the total number of misclassifications. Which model is better based on this criterion? (5) For models 1 and 2, compare the misclassification tables if one classifies a case as diabetes if the predicted probability exceeds (>=) 0.3 and otherwise classify it as non-diabetes . Which is the better boundary to use? You will be asked to supply some numbers below from doing the above. Part a) For model 1, the regression coefficient for ped is Part b) For model 2, the regression coefficient for age is Part c) For the first subject in the holdout set, the predicted probability is: for model 1, for model 2. Part d) Use a boundary of 0.5 in the predicted probabilities to decide on diabetes (predicted probability greater than or equal to 0.5) or non-diabetes. The total number of misclassifications of the 200 cases in the holdout set is: for model 1, for model 2. Part e) With a boundary of 0.5 in predicted probabilities, the better model with a lower misclassification rate is model : (enter 1 or 2, and enter model 2 in case of a tie). Part f) Use a boundary of 0.3 in the predicted probabilities to decide on diabetes (predicted probability greater than or equal to 0.3) or non-diabetes. The total number of misclassifications of the 200 cases in the holdout set is: for model 1, for model 2. There is no question on the better boundary to use, because that depends on the relative seriousness of the two types of misclassification errors.