Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 04, 2024

1 Ob jective Construct a na ve Bayes classifier to classify email as spam or not spam (ham). A Bayesian decision rule chooses the hypothesis

1 Ob jective

Construct a na ve Bayes classifier to classify email as spam or not spam ("ham"). A Bayesian decision rule chooses the hypothesis that maximizesP(Spam|x) vsP(Spam|x) for emailx.

Use any computer language you like. I recommend using Python as it includes ready access to a number of powerful packages and libraries for natural language processing (NLP). I include a few Python 3.6 excerpts to get you started. I also describe a several tools from the NLTK (natural language toolkit) topre-processdata to improve classification.

2 Na ve(conditionallyindependent)classification

Suppose that you have a dataset{xN}. Eachxk {xN}is a separate email for this assignment. Each of theNdata pointsxk= (f1, f2, . . . , fn)Pattern space =Xwheref1, f2, . . .are calledfeatures. You extract features from each data point. Features in an email may include the list of "words" (tokens) in the message body. The goal in Bayesian classification is to compute two probabilitiesP(Spam|x) andP(Spam|x) for each email. It classifies each email as "spam" or "not spam" by choosing the hypothesis with higher probability. Na ve Bayes assumes that features forxare independent given its class.

P(Spam|x) is difficult to compute in general. Expand with the definition of conditional probabilityP(Spam|x) =P(Spamx).(1)

P(x)

Look at the denominatorP(x).P(x) equals the probability of a particular email given the universe of all possible emails. This isverydifficult to calculate. But it is just a number between 0 and 1 since it is a probability. It just "normalizes"P(Spamx). Now look at the numeratorP(Spamx). First expandxinto its features{fn}. Each feature is an event that can occur or not (i.e.the word is in an email or not). So

P(Spamx)P(Spamx)=P(Spamf1fn) (2)P(x)

=P(Spam)P(f1fn|Spam) (3) Apply the multiplication theorem (HW2, 1.c) to the second term to give

P(f1fn|Spam)=P(Spam)P(f1|Spam)P(f2|Spamf1)P(fn|Spamfn1f2f1).(4) But now you are still stuck computing a big product of complicated conditional probabilities. Na ve Bayes

classification makes anassumptionthat features are conditionally independent. This means that

P(fj|fkSpam) =P(fj|Spam) (5)

ifj=k. This means that the probability you observe one feature (i.e.word) is independent of observing another word given the email is spam. This is ana veassumption and weakens your model. But you can now simplify the above to

k=1

(6)

Therefore

and similarly

P(Spam|x) =P(Spam)?nk=1P(fk|Spam)(7)P(x)

P(Spam|x) =P(Spam)?nk=1P(fk|Spam)(8)P(x)

You can ignore theP(x) normalizing term since you only care which probability is larger and it is the same in both cases. This leads to the na ve Bayesian rule (called themaximum a posteriori (MAP) estimator) between the two hypotheses{Hk}(e.g.{Spam,Spam}):

HMAP=argmaxP(Hk)?P(fk|Hk).(9)

k=1

This says to calculate the conditional probabilities for each word in your email assuming the email is spam

and then again assuming it is not spam. Then choose the hypothesis with higher conditional probability.

3 Implementation

There is a upside to corporate scandal. During prosecution for the Enron accounting scandal (https: //en.wikipedia.org/wiki/Enron_scandal) plaintiffs were compelled to disclose a massive historical trove of all emails they sent and received. Researchers read through and categorized a subset of these messages as spam and ham. Download one of the curated "pre-processed" files from:http://nlp.cs.aueb.gr/ software_and_datasets/Enron-Spam/. Each archive contains two folders: "spam" and "ham". Each of folder contains thousands emails each stored in its own file. Open several files in the directories and familiarize yourself with the data.

The next sections step you through one possible way to implement your na ve Bayes classifier. You may use the code blocks as provided or supplement them with your own improvements.

3.1 Read the Emails

The following code block read the raw emails from the extracted Enron archive. It prepares a single list with all the emails and a "tag" to track your classifiers accuracy. It requires the path to the root Enron data folder (BASE DATA DIR).

defload_data(dir):

list= []

for file inos.listdir(dir):

withopen(dir+ '/' +file, 'rb') as f:

body = f.read().decode('utf-8', errors='ignore').splitlines()list.append(' '.join(body))

return list

BASE_DATA_DIR=''# fill me in

 # load and tag data

ham = [(text, 'ham')fortextinload_data(BASE_DATA_DIR + '/ham')] spam = [(text, 'spam')fortextinload_data(BASE_DATA_DIR +

'/spam')]

all= ham + spam'''

 when complete: all = [ (ham_email_1_str, 'ham'), (ham_email_2_str, 'ham'), ... (spam_email_1_str, 'spam'), (spam_email_2_str, 'spam') ...

] '''

3.2 Pre-Processing

The raw emails are not ideal for classification. Consider applying each of the following operations in a "pre-processing" step to prepare the data. You can wrap all the functions in a single function as shown below.

Operations

defpreprocess(text):

# add all preprocessing here, # ...

# ...

 # then return a list of tokens (words)

returntokens

all= [(preprocess(text), label)for(text,label)in all]

normalizePython is case-sensitive. This means that your classifier will treat differing cases (e.g.Tomorrowandtomorrow) as different words. Try pre-processing your entire message to lower case (text = text.lower()) and check if that improves accuracy.

tokenizeThe python natural language toolkit (NLTK) provides useful string processing utilities. Tokeniz- ing splits a long string into tokens (often called "words"). You can also tell the tokenizer to remove tokens that don't look like words (such as numbers and punctuation). To get a list of tokens from a text string you can use the NLTK tokenizer like

fromnltk.tokenizeimportRegexpTokenizer# only keep alphabet words

tokenizer = RegexpTokenizer(r'[a-z]+')

# get list of list of tokens

 tokens = tokenizer.tokenize(text)

lemmatizeLemmatizing is the process of removing common word suffixes such as "-ing" and "-ed", "- s" (plural). Otherwise your classifier will treat different forms of the same word (e.g."vacation" vs

"vacations") as different words. To clean this list of tokens you can use the NLTK lemmatizer like

fromnltk.stemimportWordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# map list of tokens

tokens = [lemmatizer.lemmatize(t)fortintokens]

stopwordsStopwords are common words that do not convey much of information (e.g."the", "and", and "as"). And these words can wash out signals from more important words. To filter the stopwords from your list of tokens you can apply the NLTK list of standard English stop words like

fromnltk.corpusimportstopwords

stoplist = stopwords.words('english')

# filter list of list of tokens

tokens = [tfortintokensif nottinstoplist]

otherOther filtering? You may be able to identify a common pattern that leads to a number of misclassification. For instance each of the message files includes a "Subject" line as the first line with the email body following. Also some messages are in html instead of plain text format and you may want to augment the list of stopwords with common html tag names.

3.3 PreparetrainandtestSets

It is bad statistical practice to build and test your model with the same data. Instead you should holdout some fraction of your data (often 20%) and use the remaining data (80%) to build (train) your model. Thentestyour model with the holdout data. The following shows how to achieve an 80/20 split intotrainandtestsets.

 # shuffle (may disable for debug)

random.shuffle(all)

# split train/test

splitp = 0.80# 80/20 split

train =all[:int(splitp*len(all))] test =all[int(splitp*len(all)):]

3.4 EstimateP(fk|Hk)-i.e.P(word|Spam)

You should construct two dictionaries (one spam dictionary and one ham dictionary) to estimateP(fk|A) andP(fk|AC). The following code shows how you can iterate through thetrainingset to build the dictionaries. Each time you encounter a word setDict[word] = Dict[word] + 1. When you finish you can estimate of the probability of words in spam messages by looking up the count in your dictionary and dividing by the total number of words across all spam messages.

 SpamDict = {} HamDict{}

deffeaturizeTokens(tokens, is_spam):'''

check each word in the email count into the correct dictionary (use an if statement) '''

for(tokens,label)intrain: featurizeTokens(tokens, label == 'spam')

3.5 CompareP(Spam|x)andP(Spam|x)

You are ready to calculateP(Spam|x) for a given email! Iterate over yourtestemails. Compute the following product:P(Spam)?nk=1P(fk|Spam).P(Spam) is the probability of a message being spam and depends on the number of spam messages in your training set (how do you compute it?). Use yourSpamDictto estimateP(fk|Spam) for each word in the message. Repeat the calculation to computeP(Spam)?nk=1P(fk|Spam). Then assign the email to the category with the highest probability.

Two issues

1.P(fk|Spam)<<1.

P(fk|Spam) will be very small since the probability that the wordfkoccurs in any email is small. Therefore?P(fk|Spam) will beextremelysmall. This causes computational precision problems. Avoid this issue by using the fact that you can turn a product into a sum of logs,i.e.

N?N?P(Spam)?P(fk|Spam)=P(Spam)exp?lnP(fk|Spam) (10)

k=1k=12.Test emails contain words that are not in your dictionary.

If a word never appears in any training emails thenP(fk|Spam) = 0 sinceSpamDict[word] = 0. Therefore?P(fk|Spam) = 0. You can solve this in a few ways. One method is to "skip" these words by puttingP(fk|Spam) to some value that will not alter the probability when multiplied (what value would you choose)? The other is to use Laplace smoothing:

P(word|Spam) =SpamDict[word]+ 1(11)sum(SpamDict.values())+len(SpamDict)+ 1

The idea here is that you assume that every word (even words not in your training set) has at least some very small probability of occurring (called a "uniform prior"). This resolves the issue by givingP(word|Spam)>0 even ifSpamDict[word] = 0.