Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords

The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, 0 for important emails). You have to predict the probability that a message is spam or not.

1) Partition the data into a training set (with 70% of the observations), and testing set (with 30% of the observations) using the random state of 12345 for cross validation.

2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the best k?)

3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers.

4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide explanation to support your argument.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Spomenik Monument Database

Authors: Donald Niebyl, FUEL, Damon Murray, Stephen Sorrell

1st Edition

0995745536, 978-0995745537

More Books

Students also viewed these Databases questions

Question

3 How supply and demand together determine market equilibrium.

Answered: 1 week ago

Question

Explain the function and purpose of the Job Level Table.

Answered: 1 week ago