Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Nave Bayes Classifier Overview: In this assignment we will classify an email as spam or legit. Specifically emails from enron that have been made publicly

Nave Bayes Classifier Overview:

In this assignment we will classify an email as spam or legit. Specifically emails from enron that have been made publicly available have been recoded into a term-document matrix that shows each email and words that appear (something well learn more about in text mining). I have provided you code to get started. Please follow the directions below (and code) to create a nave bayes classifier model to predict an email as legit or spam. 1. Download the email.csv from canvas and import to a dataset named email 2. Review the term document matrix, how many columns does it have? 3. Run the following code to see the top words used in the spam email spam_df = email_df.loc[email_df['message_label'] == 'spam'] spam_totals = spam_df.groupby('message_label').sum() spam_totals = spam_totals.drop('message_index', axis=1) spam_totals.T.sort_values(by='spam', ascending=False).head(10) 4. Modify the same code in 3 to identify the top words in legit emails. What is the top appearing word in a spam email? What is the top appearing word in a legit email? 5. Convert the dependent variable message label to a 1,0 categorical outcome. 6. Run the following code to transform the binary classification into categorical variables word_list = email_df.columns for col in [word_list]: email_df[col] = email_df[col].astype('category') 7. Split the data into 75% training and 25% validation using random_state = 2 and stratify = y 8. Is there a proportion imbalance? 9. Create a nave bayes classifier to predict message_label on the training set 10. Using the predict function and the confusion matrix/classification report, what is the overall accuracy of the model on the training and validation sets?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Advances In Spatial And Temporal Databases 11th International Symposium Sstd 2009 Aalborg Denmark July 8 10 2009 Proceedings Lncs 5644

Authors: Nikos Mamoulis ,Thomas Seidl ,Kristian Torp ,Ira Assent

2009th Edition

3642029817, 978-3642029813

More Books

Students also viewed these Databases questions