Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Your code will read in an email message in some standard format (we will determine that standard) and will classify whether that email is a

Your code will read in an email message in some standard format (we will determine that standard) and will classify whether that email is a spam or non-spam email. There are databases containing example spam and non-spam emails that you can download and use to train your machine learning filter to make this determination. More about that database will be provided. The basic steps in determining whether a text file containing an email is a spam or non-spam is as followed: 1. Prepare the text data by reading in the file 2. Create the word dictionary to count the words and their frequency 3. Extract the feature from the dictionary created and prepare the data to input into the machine learning function (whether it is for training or for testing) 4. Pick a machine learning algorithm to train on your features from step 3 5. Finally, you will test your code using sample emails to see how well your software works. Step 0: Download the Ling-spam corpus data http://www2.aueb.gr/users/ion/data/enron-spam/ Step 1: Prepare the data Divide the data into a training set and test set (perhaps a 4:1 ratio is good), where the training data will contain equal number of spam and non-spam emails and the testing data will containing equal number of spam and non-spam emails. In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email. The emails in Ling-spam corpus have been already preprocessed in the following ways: a) Removal of stop words Stop words like and, the, of, etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails. b) Lemmatization It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. For example, include, includes, and included would all be represented as include. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence). We still need to remove the non-words like punctuation marks or special characters from the mail documents. There are several ways to do it. Here, we will remove such words after creating a dictionary, which is a very convenient method to do so since when you have a dictionary, you need to remove every such word only once. This has already been done for you!!! Step 2: Create the dictionary A sample email in the data-set looks like this: Subject: posting hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel sutton ( sutton @ garnet . berkeley . edu It can be seen that the first line of the mail is subject and the 3rd line contains the body of the email. We will only perform text analytics on the content to detect the spam mails. As a first step, we need to create a dictionary of words and their frequency. Once the dictionary is created we can add just a few lines of code to remove non-words about which we talked in step 1. Step 3: Extract the Features Once the dictionary is ready, we can extract word count vector (our feature here) of 3000 dimensions for each email of training set. Each word count vector contains the frequency of 3000 words in the training file. Most of them will be zero. Let us take an example. Suppose we have 500 words in our dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Suppose text in training file was Get the work done, work done then it will be encoded as [0,0,0,0,0,.0,0,2,0,0,0,,0,0,1,0,0,0,0,1,0,0,2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero. Step 4: Train the Classifiers There are many different machine learning algorithms. We will use an industry standard machine learning library called PyTorch: https://pytorch.org/ For this lab assignment, you will use a method called Support Vector Machines (SVM). SVMs are supervised binary classifiers which are very effective when you have higher number of features. The goal of SVM is to separate some subset of training data from rest called the support vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts the class of the test data is based on support vectors and makes use of a kernel trick. Once the classifiers are trained, we can check the performance of the models on test-set. We extract word count vector for each mail in test-set and predict its class (non-spam or spam) with the trained SVM model.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Relational Database And Transact SQL

Authors: Lucy Scott

1st Edition

1974679985, 978-1974679980

More Books

Students also viewed these Databases questions

Question

Which are non projected Teaching aids in advance learning system?

Answered: 1 week ago

Question

Distinguish between formal and informal reports.

Answered: 1 week ago