Question
This question involves logistic regression analysis of the Pima data set in R on risk factors for diabetes among Pima women. Your training and holding
This question involves logistic regression analysis of the Pima data set in R on risk factors for diabetes among Pima women. Your training and holding data sets will be subsets of the Pima.tr and Pima.te data sets in the library MASS. The binary response variable is "type" (type=Yes for Diabetes, type=No for no diabetes). Get your training set and holdout set with the following in R. itrain=c(88, 56, 115, 4, 75, 31, 94, 124, 198, 112, 41, 87, 58, 110, 159, 90, 179, 156, 71, 68, 59, 55, 2, 183, 150, 137, 135, 16, 131, 186, 70, 133, 12, 61, 85, 111, 153, 84, 199, 9, 51, 196, 154, 200, 122, 119, 30, 106, 97, 74, 47, 151, 99, 26, 14, 168, 73, 128, 175, 40, 44, 93, 42, 142, 34, 77, 50, 157, 60, 178, 116, 103, 82, 140, 126, 96, 23, 86, 191, 80, 108, 127, 7, 185, 45, 194, 95, 5, 148, 114, 160, 144, 48, 19, 35, 147, 24, 53, 64, 121, 184, 29, 158, 193, 161, 65, 36, 192, 72, 164, 13, 173, 17, 63, 102, 130, 33, 139, 52, 180, 1, 28, 146, 152, 91, 37, 149, 138, 89, 167, 197, 21, 38, 172, 20, 18, 49, 27, 190, 54, 11, 136, 109, 57, 46, 62, 170, 78, 107, 104, 98, 195, 171, 8, 177, 67, 79, 15, 120, 182, 92, 129, 166, 181, 163, 69, 83, 22, 10, 32) ihold=c(122, 183, 194, 6, 298, 44, 182, 90, 325, 88, 24, 131, 229, 149, 221, 134, 112, 233, 57, 145, 155, 327, 283, 234, 97, 95, 254, 310, 230, 332, 64, 235, 247, 291, 166, 271, 317, 34, 16, 66, 312, 202, 41, 299, 110, 248, 227, 38, 111, 3, 48, 210, 274, 308, 75, 199, 240, 174, 259, 215, 326, 208, 157, 26, 156, 224, 96, 158, 201, 251, 23, 193, 150, 85, 292, 53, 218, 55, 11, 301, 127, 216, 239, 191, 209, 104, 86, 173, 237, 222, 255, 39, 175, 65, 139, 135, 129, 137, 33, 159, 285, 148, 192, 319, 107, 320, 19, 288, 138, 59, 205, 92, 232, 45, 200, 287, 91, 62, 40, 290, 165, 67, 132, 214, 295, 73, 226, 187, 212, 228, 204, 123, 181, 176, 231, 275, 37, 29, 78, 21, 100, 116, 109, 115, 68, 42, 15, 144, 1, 177, 141, 171, 249, 14, 170, 119, 180, 5, 270, 152, 314, 322, 323, 315, 32, 20, 265, 258, 302, 304, 54, 185, 60, 241, 136, 154, 61, 130, 52, 243, 18, 101, 256, 69, 151, 102, 313, 70, 108, 294, 278, 82, 35, 250, 105, 124, 74, 13, 198, 164) library(MASS) data(Pima.tr) mytrain=Pima.tr[itrain,] data(Pima.te) myhold=Pima.te[ihold,] Next do the following: (1) Fit the logistic regression model with all 7 explanatory variables npreg, glu, bp, skin, bmi, ped, age. Call this model 1. (2) Fit the logistic regression model with 4 explanatory variables glu, bmi, ped, age (this is best model from backward elimination if all cases of Pima.tr is used). For this model with 4 explanatory variables, call it model 2. (3) Apply both models 1 and 2 to the holdout data set and get the predicted probabilities. Classify a case as diabetes if the predicted probability exceeds (>=) 0.5 and otherwise classify it as non-diabetes . (4) For models 1 and 2, get the total number of misclassifications. Which model is better based on this criterion? (5) For models 1 and 2, compare the misclassification tables if one classifies a case as diabetes if the predicted probability exceeds (>=) 0.3 and otherwise classify it as non-diabetes . Which is the better boundary to use? You will be asked to supply some numbers below from doing the above. Part a) For model 1, the regression coefficient for ped is Part b) For model 2, the regression coefficient for age is Part c) For the first subject in the holdout set, the predicted probability is: for model 1, for model 2. Part d) Use a boundary of 0.5 in the predicted probabilities to decide on diabetes (predicted probability greater than or equal to 0.5) or non-diabetes. The total number of misclassifications of the 200 cases in the holdout set is: for model 1, for model 2. Part e) With a boundary of 0.5 in predicted probabilities, the better model with a lower misclassification rate is model : (enter 1 or 2, and enter model 2 in case of a tie). Part f) Use a boundary of 0.3 in the predicted probabilities to decide on diabetes (predicted probability greater than or equal to 0.3) or non-diabetes. The total number of misclassifications of the 200 cases in the holdout set is: for model 1, for model 2. There is no question on the better boundary to use, because that depends on the relative seriousness of the two types of misclassification errors.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started