Question
Your code will read in an email message in some standard format (we will determine that standard) and will classify whether that email is a
Your code will read in an email message in some standard format (we will determine that standard) and will classify whether that email is a spam or non-spam email. There are databases containing example spam and non-spam emails that you can download and use to train your machine learning filter to make this determination. More about that database will be provided. The basic steps in determining whether a text file containing an email is a spam or non-spam is as followed: 1. Prepare the text data by reading in the file 2. Create the word dictionary to count the words and their frequency 3. Extract the feature from the dictionary created and prepare the data to input into the machine learning function (whether it is for training or for testing) 4. Pick a machine learning algorithm to train on your features from step 3 5. Finally, you will test your code using sample emails to see how well your software works. Step 0: Download the Ling-spam corpus data http://www2.aueb.gr/users/ion/data/enron-spam/ Step 1: Prepare the data Divide the data into a training set and test set (perhaps a 4:1 ratio is good), where the training data will contain equal number of spam and non-spam emails and the testing data will containing equal number of spam and non-spam emails. In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email. The emails in Ling-spam corpus have been already preprocessed in the following ways: a) Removal of stop words Stop words like and, the, of, etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails. b) Lemmatization It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. For example, include, includes, and included would all be represented as include. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence). We still need to remove the non-words like punctuation marks or special characters from the mail documents. There are several ways to do it. Here, we will remove such words after creating a dictionary, which is a very convenient method to do so since when you have a dictionary, you need to remove every such word only once. This has already been done for you!!! Step 2: Create the dictionary A sample email in the data-set looks like this: Subject: posting hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel sutton ( sutton @ garnet . berkeley . edu It can be seen that the first line of the mail is subject and the 3rd line contains the body of the email. We will only perform text analytics on the content to detect the spam mails. As a first step, we need to create a dictionary of words and their frequency. Once the dictionary is created we can add just a few lines of code to remove non-words about which we talked in step 1. Step 3: Extract the Features Once the dictionary is ready, we can extract word count vector (our feature here) of 3000 dimensions for each email of training set. Each word count vector contains the frequency of 3000 words in the training file. Most of them will be zero. Let us take an example. Suppose we have 500 words in our dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Suppose text in training file was Get the work done, work done then it will be encoded as [0,0,0,0,0,.0,0,2,0,0,0,,0,0,1,0,0,0,0,1,0,0,2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero. Step 4: Train the Classifiers There are many different machine learning algorithms. We will use an industry standard machine learning library called PyTorch: https://pytorch.org/ For this lab assignment, you will use a method called Support Vector Machines (SVM). SVMs are supervised binary classifiers which are very effective when you have higher number of features. The goal of SVM is to separate some subset of training data from rest called the support vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts the class of the test data is based on support vectors and makes use of a kernel trick. Once the classifiers are trained, we can check the performance of the models on test-set. We extract word count vector for each mail in test-set and predict its class (non-spam or spam) with the trained SVM model.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started