Answered step by step
Verified Expert Solution
Question
1 Approved Answer
The spam data file contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54
- The spam data file contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters, etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, 0 for important emails).
- You have to predict the probability that a message is spam or not.
- 1) Partition the data into a training set (with 70% of the observations), and a testing set (with 30% of the observations) using the random state of 12345 for cross-validation.
- 2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the 'best k'?)
- 3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers.
- 4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide an explanation to support your argument
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started