Answered step by step
Verified Expert Solution
Question
1 Approved Answer
The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54
The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, O for important emails). You have to predict the probability that a message is spam or not. 1) Partition the data into a training set (with 70% of the observations), and testing set (with 30% of the observations) using the random state of 12345 for cross validation. (10 points) 2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the 'best k'?) (15 points) 3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers. (15 points) 4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide explanation to support your argument. (10 points) A
Step by Step Solution
There are 3 Steps involved in it
Step: 1
1 Partitioning the data into a training set and testing set python import pandas as pd from sklearnm...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started