Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

covid_19_dataset.csv test_date : The date in which the person received the COVID-19 test. cough : binary variable which equals 1 if the person has a

covid_19_dataset.csv

test_date : The date in which the person received the COVID-19 test. cough : binary variable which equals 1 if the person has a cough.

fever : binary variable which equals 1 if the person has a fever. sore_throat : binary variable which equals 1 if the person has a sore throat.

shortness_of_breath : binary variable which equals 1 if the person has stated that they are having shortness of breath.

corona_result: variable which equals positive if the test came back positive, negative if the test came back negative, and other if the the result was inconclusive.

age_60_and_above: binary variable which equals No or Yes. 1

gender: The dataset includes a self-reported value of male or female. Please answer the following questions:

1. Initial Questions: (a) How many people in this dataset tested positive for COVID-19? How many tested negative for COVID-19? Offer a possible explanation for the large difference between these numbers.

(b) In preparation for our analysis, a new dataset which removes any observations which satisfy corona_result = other. For the remaining observations, convert corona_result into a numeric variable that equals the number 1 if the person tested positive and 0 otherwise. Finally, remove any observations with missing values for age_60_and_above and gender.

(c) Randomly split the data into a train and test set, with approximately 90% of the data in the train set. Make sure that the train and test set preserve the relative ratio of positive to negative cases Hint: Use the sample.split() function from the caTools library.

2. Logistic Regression: (a) Build a logistic regression model from the training set using the glm() function to predict whether a person is positive for COVID-19.

(b) Report the confusion matrix of your logistic regression model on the train set when the threshold is set to 0.5. Compute the accuracy, true positive rate, and false positive rate for the model.

(c) Report the confusion matrix of your logistic regression model on the test set when the threshold is set to 0.5. Compute the accuracy, true positive rate, and false positive rate for the model.

(d) In general, we say that a model is overfitting if the accuracy of the model on the train set is significantly higher than the accuracy of the model on the test set. Based on your answers to parts (b) and (c), do you feel that the logistic regression model is overfitting the data?

(e) Do you believe that this model would be useful in real life? Answer this question by considering your estimates of the true positive rate and false positive rates of the model on unseen data (i.e., the true positive rate and false positive rate that you computed on the test set in part (c)).

(f) Plot the ROC curve of your logistic regression model on the test set using the ROCR library.

(g) Propose a threshold value which would ensure that your logistic regression model has a true positive rate of at least 50% on the test set. Hint: This can be found by experimenting with the threshold parameter.

(h) Using the coefficients of your logistic regression model, answer the following questions: (i) Holding all other independent variables constant, how much would having shortness of breath multiply the odds of testing positive for COVID-19?

(ii) Holding all other independent variables constant, how much would having a headache multiply the odds of testing positive for COVID-19?

(iii) Holding all other independent variables constant, how much would being over the age of 60 multiply the odds of testing positive for COVID-19? Based on this number, what can we conclude about the relevance of a person's age in predicting whether or not they have COVID19?

3. Decision Tree: (a) Use the rpart and rpart.plot libraries to train a classification tree using the train set to predict whether a person has COVID-19. Provide the plot of the decision tree.

(b) We recall one of the benefits of classification trees is that they are interpretable. Based on your plot, explain (in words) how the tree determines whether someone has COVID-19. What independent variables does the tree reveal are most important in accurately predicting whether someone has COVID-19? Page 2

(c) Report the confusion matrix of your classification tree on the test set when the threshold is set to 0.5. Compute the accuracy, true positive rate, and false positive rate for the tree.

(d) Use 5-fold cross validation to find a choice for the complexity parameter (cp) that maximizes the accuracy of the model. This can be performed using the Caret library. Report the best choice of the cp parameter.

4. Concluding Questions: (a) All else being equal, we generally would like to train binary classification models that have a high true positive rate ( 1) and a low false positive rate ( 0). When evaluating your models in this assignment, however, you likely found that that the true positive rates in your models was typically quite poor when using a threshold of 0.5. Why do you think this was the case?

(b) (Extra credit) The binary classification models you built predicted the probabilities that a person has COVID-19. However, the predicted probability of someone having COVID-19 ought to take into account the proportion of people in the community that actually have COVID-19. For example, the models that you created would be very inaccurate if applied to Australia (where the COVID-19 rate is essentially zero). Propose a modification to your model/analysis which can be used to form accurate predictions in a community in which the COVID-19 rate is currently equal to 10%.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Calculus For Scientists And Engineers Early Transcendentals

Authors: William L Briggs, Bernard Gillett, Bill L Briggs, Lyle Cochran

1st Edition

0321849213, 9780321849212

More Books

Students also viewed these Mathematics questions

Question

What is the status (prevalence) of unions today?

Answered: 1 week ago