Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on May 19, 2024

Please use the link below to answer the question. For this assignment please use R or R Studio. https://drive.google.com/file/d/1JUfmGe_3Tr_3hkM55dV-rAYCBuTizIsX/view?usp=sharing PREREQUISITES - Getting Set Up First,

Please use the link below to answer the question. For this assignment please use R or R Studio.

https://drive.google.com/file/d/1JUfmGe_3Tr_3hkM55dV-rAYCBuTizIsX/view?usp=sharing

PREREQUISITES - Getting Set Up First, set the working directory to the folder where you downloaded the data file above. Second, add the following two lines to your code file AS THE FIRST TWO LINES (if you are using a code file), or else copy them into your R console BEFORE DOING ANYTHING ELSE (if you are working directly in the command line): o library(caret) o set.seed(32343)

The first command imports the caret library for machine learning, and the set.seed method will ensure that everyone gets the same answer when using randomized methods such as random forests. YOU MUST MAKE SURE TO DO THESE TWO STEPS BEFORE PROCEEDING OR YOUR ANSWERS WILL NOT BE CORRECT.

1. In this assignment we are going to build predictive models which will use cell/tissue sample information to predict whether the sample is normal or breast cancer. To begin, read the above data set and use the head to inspect the first few rows. Which column(s) represent the dependent variable (i.e. the one we want to predict)? (1 point)

Patient.ID Clump.Thickness Size.Uniformity Shape.Uniformity Marginal.Adhesion

Epithelial.Cell.Size Bare.Nuclei Bland.Chromatin Normal.Nuclei Mitosis Diagnosis

2. Looking at the data above, which column(s) represent the independent variable(s) (i.e. the one(s) which may conceivably have a predictive relationship with what we are trying to predict)? (1 point) Patient.ID Clump.Thickness Size.Uniformity Shape.Uniformity Marginal.Adhesion Epithelial.Cell.Size Bare.Nuclei Bland.Chromatin Normal.Nuclei Mitosis Diagnosis

3. How many benign cases are found in this data? (Enter only a number) *Hint: Use subsetting to find this (1 point)Answer:

4. How many malignant cases are found in this data? (Enter only a number) (1 point)

Answer:

5. Is this data set balanced? (1 point) Yes, it is approximately balanced No, it is slightly imbalanced (i.e. one class is no more than twice the size of the other) No, it is moderately imbalanced (one class is between 2 to 5 times the size of the other) No, it is highly imbalanced (one class is more than 5 times the size of the other)

INSTRUCTIONS: Please carry out the following steps to prepare your data for further analysis before proceeding to the next questions:

1. Use data slicing to get rid of any columns in the data set which are neither dependent nor independent variables as you specified above. 2. Split the data into training and test sets. The training set should contain exactly 70% of the data and the test set should contain the remaining 30%.

MODEL 1- Logistic Regression: Construct a logistic regression model to predict the dependent variable identified above from the independent variables you identified (in questions 1 and 2 above).DO NOT USE ANY SCALING OR OTHER PREPROCESSING.Train your model on the training data and test/evaluate its performance using the test data. When evaluating your results, make sure to set the positive class to 'Malignant'. You will use the results of this model evaluation to answer the next 8 questions.

6. What is the accuracy of this model? (Enter a number only, no rounding) (1 point) Answer:

7. What is the precision of this model? (Enter a number only, no rounding) (1 point) Answer:

8. What is the recall of this model? (Enter a number only, no rounding) (1 point)

Answer:

9. What is the balanced accuracy of this model? (Enter a number only, no rounding) (1 point) Answer:

10. How many cases in the test data did this model correctly predict as benign? (Enter only a number, no rounding) (1 point) Answer:

11. How many cases in the test data did this model correctly predict as malignant? (Enter only a number, no rounding) (1 point)Answer:

12. How many cases in the test data did this model incorrectly predicted as benign (i.e. how many false negatives)? (Enter only a number, no rounding) (1 point)Answer:

13. How many cases in the test data did this model incorrectly predict as malignant (i.e. how many false positives)? (Enter only a number, no rounding) (1 point)Answer:

MODEL 2- Naive Bayes: Construct a Naive Bayes model to predict the dependent variable identified above from the independent variables you identified (in questions 1 and 2 above). DO DO NOT USE ANY SCALING OR OTHER PREPROCESSING. Train your model on the training data and test/evaluate its performance using the test data. When evaluating your results, make sure to set the positive class to 'Malignant'. You will use the results of this model evaluation to answer the next 8 questions.

14. What is the accuracy of this model? (Enter a number only, no rounding) (1 point)

Answer:

15. What is the precision of this model? (Enter a number only, no rounding) (1 point) Answer:

16. What is the recall of this model? (Enter a number only, no rounding) (1 point) Answer:

17. What is the balanced accuracy of this model? (Enter a number only, no rounding) (1 point) Answer:

18. How many cases in the test data did this model correctly predict as benign? (Enter only a number, no rounding) (1 point) Answer:

19. How many cases in the test data did this model correctly predict as malignant? (Enter only a number, no rounding) (1 point) Answer:

20. How many cases in the test data did this model incorrectly predicted as benign (i.e. how many false negatives)? (Enter only a number, no rounding) (1 point) Answer:

21. How many cases in the test data did this model incorrectly predict as malignant (1 point)

(i.e. how many false positives)? (Enter only a number, no rounding) Answer:

MODEL 3 - Random Forest: Construct a Random Forest model to predict the dependent variable identified above from the independent variables you identified (in questions 1 and 2 above).DO NOT USE ANY SCALING OR OTHER PREPROCESSING. Train your model on the training data and test/evaluate its performance using the test data. When evaluating your results, make sure to set the positive class to 'Malignant'. You will use the results of this model evaluation to answer the next 8 questions.

22. What is the accuracy of this model? (Enter a number only, no rounding) (1 point) Answer:

23. What is the precision of this model? (Enter a number only, no rounding) (1 point) Answer:

24. What is the recall of this model? (Enter a number only, no rounding) (1 point) Answer:

25. What is the balanced accuracy of this model? (Enter a number only, no rounding) (1 point)

Answer:

26. How many cases in the test data did this model correctly predict as benign? (Enter only a number, no rounding) (1 point)

Answer:

27. How many cases in the test data did this model correctly predict as malignant? (Enter only a number, no rounding) (1 point)

Answer:

28. How many cases in the test data did this model incorrectly predict as benign (i.e. how many false negatives)? (Enter only a number, no rounding) (1 point)

Answer:

29. How many cases in the test data did this model incorrectly predict as malignant (i.e. how many false positives)? (Enter only a number, no rounding) (1 point)

Answer:

Model Comparison: Use the results from your three models above to answer the remaining questions.

30. Based on the three models you constructed and the results, can we make good predictions about a tissue sample being breast cancer or normal? (1 point)

No, accuracy was below 60% in all models No, balanced accuracy was below 60% for all models No, accuracy, precision, recall, and balanced accuracy were below 60% for all models This depends on the model, in some cases accuracy and precision were below 60% Yes, accuracy was above 90% in all cases Yes, accuracy, precision, recall, and balanced accuracy were above 90% in all models.

31. Suppose we want to select the model with the least chance of missing a case of breast cancer (i.e. missing a positive instance). Which metric should we use to compare models? (1 point)

Prevalance Specificity Detection Prevalence Sensitivity Balanced Accuracy Neg Pred Value Pos Pred Value Detection Rate Kappa

32. Suppose we want to select the model in which we can have the greatest certainty that if the predicted outcome is positive (i.e. predicted as malignant), this is in fact correct. Which metric should we use to compare models? (1 point)

Sensitivity Prevalance Detection Rate Detection Prevalence Pos Pred Value Balanced Accuracy Kappa Specificity Neg Pred Value

33. By the criteria chosen above, which model would be the worst? (1 point) Logistic Regression Naive Bayes Random Forest