Question

1 Approved Answer

Posted on Sep 09, 2024

BreastCancer Rstudio Rmd file: load library mlbench, install if needed. You have to load the data frame into memory with data(BreastCancer) Run str() and head()

BreastCancer

Rstudio Rmd file: load library mlbench, install if needed. You have to load the data frame into memory with data(BreastCancer) Run str() and head() on BreastCancer and summary() on just the Class column. Use R instructions to calculate the percent in each class, and print them with an appropriate heading using paste().

Cell.size and Cell.shape are in one of 10 levels. Build a logistic regression model called glm0, where Class is predicted by Cell.size and Cell.shape. Do you get any error or warning messages? Run summary on glm0 to confirm that it did build a model. Write a comment about why you think you got this warning message and what you could possibly do about it.

Notice in the summary() of glm0 that most of the levels of Cell.size and Cell.shape became predictors and that they had very low p-values. We wont be able to build a good logistic regression model this way. It might be better to just have 2 levels for each variable. In this step, add two new columns to BreastCancer as listed below. Run summary() on Cell.size and Cell.shape as well as the new columns. Comment on the distribution of the new columns. Do you think what we did is a good idea? Why?

a. Cell.small which is a binary factor that is 1 if Cell.size==1 and 0 otherwise

b. Cell.regular which is a binary factor that is 1 if Cell.shape==1 and 0 otherwise

Create conditional density plots using the original Cell.size and Cell.shape. First attach() the data to reduce typing. Then use par(mfrow=c(1,2)) to set up a 1x2 grid for two cdplot() graphs with Class~Cell.size and Class~Cell.shape. Observing the plots, write a sentence or two comparing size and malignant, and shape and malignant. Do you think our cutoff points for size==1 and shape==1 were justified now that you see this graph? Why or why not?

Create plots (not cdplots) with our new columns. Again, use par(mfrow=c(1,2)) to set up a 1x2 grid for two cdplot() graphs with Class~Cell.small and Class~Cell.regular. Now create two cdplot() graphs for the new columns. Now compute the following and provide a summary in the text portion of this answer. Also indicate based on these results if you think small and regular will be good predictors. Calculate:

a. percent of small observations that are malignant

b. not-small observations percent that are malignant

c. regular observations percents that are malignant

d. non-regular observations percents that are malignant

Randomly divide BreastCancer into two data sets: train (80% of the data) and test (20%). Make sure you first set the seed to 1234 so you get the same results as others.

Build a logistic regression classifier to estimate the probability of Class given Cell.small and Cell.regular. Run summary() on your model. Answer the following:

a. Which predictor(s) seem to be good predictors?.

b. Comment on the Null deviance versus the Residual deviance. Comment on the AIC score.

c. Test the model on the test data and compute accuracy. What percent accuracy did you get?

Your coefficients from the model are in units of logits. Extract the coefficient of small with glm1$coefficients[]. Answer:

a. What is the coefficient? b. How do you interpret this value?

c. Find the estimated probability of malignancy if Cell.small is true using exp().

d. Find probability of malignancy if Cell.small is true over the whole BreastCancer data set and compare result. Are they close?

Build two more models, each just using Cell.small and Cell.regular and use anova(glm_small, glm_regular, glm1) to compare all 3 models, using whatever names you used for your models. Analyze the results of the anova(). Also, compare the 3 AIC scores of the models.