Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at HewlettPackard collected data on a

Question:

Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at Hewlett–Packard collected data on a large number of e-mail messages from their postmaster and personal e-mail for the purpose of finding a classifier that can separate e-mail messages that are spam vs. nonspam (a.k.a. “ham”). The spam concept is diverse: it includes advertisements for products or websites, “make money fast” schemes, chain letters, pornography, and so on. The definition used here is “unsolicited commercial e-mail.” The file Spambase.csv contains information on 4601 e-mail messages, among which 1813 are tagged “spam.” The predictors include 57 attributes; most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., #, !) appears in the e-mail. A few predictors are related to the number and length of capitalized words. (Tip: While importing the data in RapidMiner, uncheck the box for “Skip Comments,” so attributes following any attribute name ending with the # sign will also be imported correctly.)

a. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and nonspam e-mails by comparing the spam-class average and nonspam-class average. Which are the 11 predictors that appear to vary the most between spam and nonspam e-mails? From these 11, which words or signs occur more often in spam? (Hint: Convert the target attribute to binominal type. Then, consider using the Aggregate and Transpose operators to compute class averages grouped by Spam. For the transposed data, rename the auto-generated attribute names as spam and nonspam to be meaningful, and then filter out the top extra row added by RapidMiner consisting of the class labels. Then, use the Guess Types operator to parse the averages as numeric data type. Lastly, use the Generate Attributes operator to compute the absolute difference between the class averages.)

b. Partition the data into training and holdout sets and then perform a discriminant analysis on the training data using only the 11 predictors.

c. If we are interested mainly in detecting spam messages, is this model useful? Use the confusion matrix and lift chart for the holdout set for the evaluation.

d. In the sample, almost 40% of the e-mail messages were tagged as spam. However, suppose that the actual proportion of spam messages in these e-mail accounts is 10%. Compute the intercept of the decision function to account for this information, assuming the original decision function intercept to be −2.124.

e. A spam filter that is based on your model is used, so that only messages that are classified as nonspam are delivered, while messages that are classified as spam are quarantined. In this case, misclassifying a nonspam e-mail (as spam) has much heftier results. Suppose that the cost of quarantining a nonspam e-mail is 20 times that of not detecting a spam message. Compute the intercept of the decision function to account for these costs (assume that the proportion of spam is reflected correctly by the sample proportion and the original decision function has the intercept −2.124).

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Machine Learning For Business Analytics

ISBN: 9781119828792

1st Edition

Authors: Galit Shmueli, Peter C. Bruce, Amit V. Deokar, Nitin R. Patel

Question Posted: