Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Implement a Naive Bayes classification naiveBayes _ classify ( word _ probs, message ) for classifying an email message into spam or non - spam

Implement a Naive Bayes classification naiveBayes_classify(word_probs, message) for classifying an email message into spam or non-spam by using the word probability distributions, word_probs, learned from a set of training data.
In this question, you are asked to implement the Naive Bayes method from scratch by implementing the following functions. To simplify the implementation, we assume that any message is equally likely to be spam or not-spam.
tokenize(message): extracts a set of unique words from the given text message.
count_words(training_set): creates a dictionary containing the mappings from unique words to the frequencies of the words in spam and non-spam messages in the training set
word_probabilities(counts, total_spams, total_non_spams, k=0.5): turns the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)
spam_probability(word_probs, message, total_spams, total_non_spams, k =0.5): computes the probablity of spam for the given message.
naiveBayes_classify(word_probs, message, total_spams, total_non_spams, k): classifies the message as spam or ham
Using the data set spam.csv to evaluate the classification in terms of accuracy, recall, precision, and F1-score.
from collections import Counter, defaultdict
import math,re
def tokenize(message):
"""
extracts the set of unique words from the given text message
INPUT:
message: a piece of text
OUTPUT:
a set of unique words
"""
message = message.lower() # convert to lowercase
all_words = re.findall("[a-z0-9']+", message) # extract the words
return set(alldef count_words(training_set):
"""
creates a dictionary containing the mappings from unique words to the frequencies of the words in
spam and non-spam messages in the training set
INPUT:
training_set: training set consists of pairs (message, is_spam)
OUTPUT:
a map from unique words to their frequencies in spam and non-spam messages
"""
counts = defaultdict(lambda: [0,0])
for message, is_spam in training_set:
for word in tokenize(message):
counts[word][0 if is_spam else 1]+=1
return counts
_words) # remove duplicates
def count_words(training_set):
"""
creates a dictionary containing the mappings from unique words to the frequencies of the words in
spam and non-spam messages in the training set
INPUT:
training_set: training set consists of pairs (message, is_spam)
OUTPUT:
a map from unique words to their frequencies in spam and non-spam messages
"""
counts = defaultdict(lambda: [0,0])
for message, is_spam in training_set:
for word in tokenize(message):
counts[word][0 if is_spam else 1]+=1
return counts
counts = defaultdict(lambda: [0,0])
counts["wins"][0]=50
counts["wins"][1]=500
counts["wins"]
def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
"""
turns the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)
INPUT:
counts: a maps from unique words to their frequencies in spam and non-spam messages
total_spams: the total number of spam messages
total_non_spams: the total number of non-spam messages
k=0.5: the smoothing parameter, default 0.5
OUTPUT:
a list of triples (w, p(w|spam), p(w|non-spam))
"""
return [(w,
(spam + k)/(total_spams +2* k),
(non_spam + k)/(total_non_spams +2* k))
for w,(spam, non_spam) in counts.items()]
def spam_probability(word_probs, message, total_spams, total_non_spams, k =0.5):
"""
computes the probablity of spam for the given message
INPUT:
word_probs: a list of triple (w, p(w|spam), p(w|non-spam))
message: a message under classification
OUTPUT:
the probability of being spam for the message
HINTS:
First, get a set of unique words in the mesage.
Second, sum up all the log probabilities of the unique words in the message.
Third, get probabilities by taking exponentials of the probabilites (for spam and non-spam).
Finally, return the ratio of probability of spam over the sum of the probabiliy of spam and the
probability of not spam.
"""
############YOUR CODE HERE##################
return prob_spam /(prob_spam + prob_ham)
################
def naiveBayes_classify(word_probs, message, total_spams, total_non_spams, k):
"""
classifies the message as spam or ham
INPUT:
word_probs: a list of triples (w, p(w|spam), p(w|non-spam))
message: the message under classifiation
OUTPUT:
'spam' or 'ham' indicating the classification of the message.
"""
MUST WORK WITH THE FOLLOWING STATEMENTS
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Systems Design Implementation And Management

Authors: Peter Rob, Carlos Coronel

6th International Edition

061921323X, 978-0619213237

More Books

Students also viewed these Databases questions