In Exercise 9.21, we encountered a data set where we applied logistic regression to aid in spam
Question:
In Exercise 9.21, we encountered a data set where we applied logistic regression to aid in spam classification for individual emails. In this exercise, we've taken a small set of these variables and fit a formal model with the following output:
(a) Write down the model using the coefficients from the model fit.
(b) Suppose we have an observation where to multiple = 0, winner = 1, format = 0, and re subj = 0. What is the predicted probability that this message is spam?
(c) Put yourself in the shoes of a data scientist working on a spam filter. For a given message, how high must the probability a message is spam be before you think it would be reasonable to put it in a spambox (which the user is unlikely to check)? What tradeoffs might you consider? Any ideas about how you might make your spam-filtering system even better from the perspective of someone using your email service?
Data from Exercise 9.21
Spam filters are built on principles similar to those used in logistic regression. We fit a probability that each message is spam or not spam. We have several email variables for this problem: to multiple, cc, attach, dollar, winner, inherit, password, format, re subj, exclaim subj, and sent email. We won't describe what each variable means here for the sake of brevity, but each is either a numerical or indicator variable.
For variable selection, we t the full model, which includes all variables, and then we also t each model where we've dropped exactly one of the variables. In each of these reduced models, the AIC value for the model is reported below. Based on these results, which variable, if any, should we drop as part of model selection? Explain.
Consider the following model selection stage. Here again we've computed the AIC for each leave-one-variable-out model. Based on the results, which variable, if any, should we drop as part of model selection? Explain.
Step by Step Answer:
OpenIntro Statistics
ISBN: 9781943450077
4th Edition
Authors: David Diez, Mine Çetinkaya-Rundel, Christopher Barr