Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

This question involves logistic regression analysis of the Pima data set in R on risk factors for diabetes among Pima women. Your training and holding

This question involves logistic regression analysis of the Pima data set in R on risk factors for diabetes among Pima women. Your training and holding data sets will be subsets of the Pima.tr and Pima.te data sets in the library MASS. The binary response variable is "type" (type=Yes for Diabetes, type=No for no diabetes). Get your training set and holdout set with the following in R. itrain=c(88, 56, 115, 4, 75, 31, 94, 124, 198, 112, 41, 87, 58, 110, 159, 90, 179, 156, 71, 68, 59, 55, 2, 183, 150, 137, 135, 16, 131, 186, 70, 133, 12, 61, 85, 111, 153, 84, 199, 9, 51, 196, 154, 200, 122, 119, 30, 106, 97, 74, 47, 151, 99, 26, 14, 168, 73, 128, 175, 40, 44, 93, 42, 142, 34, 77, 50, 157, 60, 178, 116, 103, 82, 140, 126, 96, 23, 86, 191, 80, 108, 127, 7, 185, 45, 194, 95, 5, 148, 114, 160, 144, 48, 19, 35, 147, 24, 53, 64, 121, 184, 29, 158, 193, 161, 65, 36, 192, 72, 164, 13, 173, 17, 63, 102, 130, 33, 139, 52, 180, 1, 28, 146, 152, 91, 37, 149, 138, 89, 167, 197, 21, 38, 172, 20, 18, 49, 27, 190, 54, 11, 136, 109, 57, 46, 62, 170, 78, 107, 104, 98, 195, 171, 8, 177, 67, 79, 15, 120, 182, 92, 129, 166, 181, 163, 69, 83, 22, 10, 32) ihold=c(122, 183, 194, 6, 298, 44, 182, 90, 325, 88, 24, 131, 229, 149, 221, 134, 112, 233, 57, 145, 155, 327, 283, 234, 97, 95, 254, 310, 230, 332, 64, 235, 247, 291, 166, 271, 317, 34, 16, 66, 312, 202, 41, 299, 110, 248, 227, 38, 111, 3, 48, 210, 274, 308, 75, 199, 240, 174, 259, 215, 326, 208, 157, 26, 156, 224, 96, 158, 201, 251, 23, 193, 150, 85, 292, 53, 218, 55, 11, 301, 127, 216, 239, 191, 209, 104, 86, 173, 237, 222, 255, 39, 175, 65, 139, 135, 129, 137, 33, 159, 285, 148, 192, 319, 107, 320, 19, 288, 138, 59, 205, 92, 232, 45, 200, 287, 91, 62, 40, 290, 165, 67, 132, 214, 295, 73, 226, 187, 212, 228, 204, 123, 181, 176, 231, 275, 37, 29, 78, 21, 100, 116, 109, 115, 68, 42, 15, 144, 1, 177, 141, 171, 249, 14, 170, 119, 180, 5, 270, 152, 314, 322, 323, 315, 32, 20, 265, 258, 302, 304, 54, 185, 60, 241, 136, 154, 61, 130, 52, 243, 18, 101, 256, 69, 151, 102, 313, 70, 108, 294, 278, 82, 35, 250, 105, 124, 74, 13, 198, 164) library(MASS) data(Pima.tr) mytrain=Pima.tr[itrain,] data(Pima.te) myhold=Pima.te[ihold,] Next do the following: (1) Fit the logistic regression model with all 7 explanatory variables npreg, glu, bp, skin, bmi, ped, age. Call this model 1. (2) Fit the logistic regression model with 4 explanatory variables glu, bmi, ped, age (this is best model from backward elimination if all cases of Pima.tr is used). For this model with 4 explanatory variables, call it model 2. (3) Apply both models 1 and 2 to the holdout data set and get the predicted probabilities. Classify a case as diabetes if the predicted probability exceeds (>=) 0.5 and otherwise classify it as non-diabetes . (4) For models 1 and 2, get the total number of misclassifications. Which model is better based on this criterion? (5) For models 1 and 2, compare the misclassification tables if one classifies a case as diabetes if the predicted probability exceeds (>=) 0.3 and otherwise classify it as non-diabetes . Which is the better boundary to use? You will be asked to supply some numbers below from doing the above. Part a) For model 1, the regression coefficient for ped is Part b) For model 2, the regression coefficient for age is Part c) For the first subject in the holdout set, the predicted probability is: for model 1, for model 2. Part d) Use a boundary of 0.5 in the predicted probabilities to decide on diabetes (predicted probability greater than or equal to 0.5) or non-diabetes. The total number of misclassifications of the 200 cases in the holdout set is: for model 1, for model 2. Part e) With a boundary of 0.5 in predicted probabilities, the better model with a lower misclassification rate is model : (enter 1 or 2, and enter model 2 in case of a tie). Part f) Use a boundary of 0.3 in the predicted probabilities to decide on diabetes (predicted probability greater than or equal to 0.3) or non-diabetes. The total number of misclassifications of the 200 cases in the holdout set is: for model 1, for model 2. There is no question on the better boundary to use, because that depends on the relative seriousness of the two types of misclassification errors.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Elementary Algebra Graphs & Authentic Applications (Subscription)

Authors: Jay Lehmann

3rd Edition

0134781252, 9780134781259

More Books

Students also viewed these Mathematics questions

Question

3 When might constructivist view of self be not relevant and why?

Answered: 1 week ago

Question

The quality of the proposed ideas

Answered: 1 week ago

Question

The number of new ideas that emerge

Answered: 1 week ago