Question

1 Approved Answer

Posted on Sep 09, 2024

Assignment 3: Nave Bayes Classifier for Spam Email Prediction Procedure 1) Follows steps in the given Jupyter Notebook file, named Spam Classification Using Naive Bayes.ipynb,

image text in transcribed

Assignment 3: Nave Bayes Classifier for Spam Email Prediction Procedure 1) Follows steps in the given Jupyter Notebook file, named Spam Classification Using Naive Bayes.ipynb, to go through text data pre-processing steps and display all results by running all cells. 2) Implement Nave Bayes Classifier (NBC) in Section 2.2 in the Jupyter Notebook. (Note that you must implement it from scratch - it is Not allowed to use NBC related python library) 3) Apply the implemented NBC to the given dataset (i.e., spam.csv). 4) Compute classification accuracy for the test set and display it. 1. Contents of this notebook + Text Analysis - Explore the Data Developing Insights + Test Transformation Data Cleaning (Boroving uninportant data/ Stopwords/ Stening) Converting data into a model usable for at (Bag of words Model) - Nalve Bayes Model for Spam Classification TEXT ANALY 318 In : Tort Libraries import nu as mp import pandas as pd import watplatlib.pyplot as plt import seaborn as sns Saatplatlib inline Mornings import warnings warnings.filterwarnings ignore' Text Preprocessing import nitk nitk.download "all") # you will need to download it if you have not done so fron nlik.corpus import stopwords import string from nitk.tokenize Import word tokenize Load dataset. We will use Pandes Mbrary to load the dataset. More Information regarding Pandes can be found at hitos bandas w e are In essages - pd. read cvspow.cov", encoding - latin-1') Drop the extra columns and perhe columns messages - messages.drop(labels - anod: 2", "Unased: 3", "Und: 4"). axis - 1) Sessages.columns - I"category",text") In : display (nessages.headin-20) . Check overall information of the dataset In 1: messages.info In pie', Figsize=(5, 6), fontsize14, autopct 21.1 . Shadow - True) : Messages["category"].valuecounts().plot(kind plt.label("Spar S Hant") pit. legendar, Span"} plt.show() From above Pie chart, it can be seen that about 85% of our dataset consists of non-spam messages + As we split our data set into train and test, strated campling is recommended in case, otherwise we have a chance of our training model being skewed towards normal messages. I the sample we choose to train our model consists mejorty of normal messages, it may end up predicting everything as ham and we might not be able to figure this out since most of the messages weget are actually ham and will have a pretty good accuracy now let us check Individual Spamtham words In [ ]: spantes essages (messages "category"] han_messages - messages messages ["category") - "spa"]["text"] "halt"text") span words han words ] Since this is just classifying the massage ds spoe or noe, we can use Esatphal- This will also remove the not word in something the can't ote. In a sentisent analysis setting, it's better to use sentence.transtote(string.mobetrans(", ), chors to ) def extractSpaskordsspartiessages)! global spas words words - word.lower() for word in word_tokenize(passages spar words spas words words 1 word lower not in stopwords.words("onglish") and word.lowi def extract Hawords (hassages): words - word.lower() for word in word_tokenize(hawlessages) if word.lower) not in stopwords.words(anglish") and word. How has words words + words span rescates.apply(extract Spar words) han messages.apply(extractHarwords) In [J: print("Total Messages: lonchas messages) lon(spas messages) ) T : Top 1e spae words span words = np.array( span words) print("Top 1e span words are un pd. Series(spas words).valuecounts().leadin -18) In 1: Tey Hoe words han words = np.array(han words) print("Top 1e Han words are : ) pd. Series(has words).valuecounts .headin10) Does the length of the meccage Indicate us anything? In 1: messages messages[m rossago Length" ] - messages ["text").apply(len) age Length.describe In 1: f, axpit.subplots(2, 1, figsize. (6, 18)) sns.clstplot (Nessages[ressagest "category") - "spa"messageLength"]. bins - 2a, ax ax[@].set_labell Span Message Nord Length) sns.distplot (essages[messages["category"] = "han] [eessageLength"), bins - 20, ax ax[e]) ax[1) In 1: from nitk.stes Sport SnowballStorer def steartext): text.split() for 1 in text: Storm words + Sowballstener(english) (stomerston (1) return def puncStopw/text): text text text.translaterstr.raketrans , string.punctuation) word for word in text.split() if word. lower) not in stopwords.words glish return ".join(text) messages["text"=ressages "text").apply(stemmer) messages "text" - sagtestext").apply(encStopal) messages.headin10) You may coepare the different between orginaltext and filtered one to see the difference Convert the clean text into a feature representation In 1: from sklearn.feature extraction. text Suport Count vectorizer VPC Count Vectorizer() features_rp - vec.fit_transform(essages ["text"]).toarray() #converting to array print features p.shape) 2. MODEL APPLICATION In this section, you will implement the Naive Bayes Classifier to the input data and predct a given email is spam or ham 2.1 Firstly, convert category of SPAM and HAM messages into 1 and 0, respectively. And then split the data into training set and test set In 1: print Messages "category"]) def oncodeCategory[cat): if cat "Spa": essages ["category") - messages ["category").apply(encodeCategory) #convert data to y array Messages r e ssages[category .to_nuspy) print(Gessages no) In [J: from sklearn.sodel_selection import train test split x train, x_test, y train, y test - train test_split features np, ve s np. stratify - messages np. test size 8.3, Pandon sta In 1: print("The size of a nossageslen(hannessages >> print("The size of span messages:", len span messages) print("The size of total samples", lon(essages[ "category"]>) print("The size of traing samples!". X train.shape) print("The size of testing samples: ".x_test.shape) print("The size of spar message in training samples", lenty trainly train print"The size of hannessages in training samples: lomy_trainty_train 1]> ]>> Assignment 3: Nave Bayes Classifier for Spam Email Prediction Procedure 1) Follows steps in the given Jupyter Notebook file, named Spam Classification Using Naive Bayes.ipynb, to go through text data pre-processing steps and display all results by running all cells. 2) Implement Nave Bayes Classifier (NBC) in Section 2.2 in the Jupyter Notebook. (Note that you must implement it from scratch - it is Not allowed to use NBC related python library) 3) Apply the implemented NBC to the given dataset (i.e., spam.csv). 4) Compute classification accuracy for the test set and display it. 1. Contents of this notebook + Text Analysis - Explore the Data Developing Insights + Test Transformation Data Cleaning (Boroving uninportant data/ Stopwords/ Stening) Converting data into a model usable for at (Bag of words Model) - Nalve Bayes Model for Spam Classification TEXT ANALY 318 In : Tort Libraries import nu as mp import pandas as pd import watplatlib.pyplot as plt import seaborn as sns Saatplatlib inline Mornings import warnings warnings.filterwarnings ignore' Text Preprocessing import nitk nitk.download "all") # you will need to download it if you have not done so fron nlik.corpus import stopwords import string from nitk.tokenize Import word tokenize Load dataset. We will use Pandes Mbrary to load the dataset. More Information regarding Pandes can be found at hitos bandas w e are In essages - pd. read cvspow.cov", encoding - latin-1') Drop the extra columns and perhe columns messages - messages.drop(labels - anod: 2", "Unased: 3", "Und: 4"). axis - 1) Sessages.columns - I"category",text") In : display (nessages.headin-20) . Check overall information of the dataset In 1: messages.info In pie', Figsize=(5, 6), fontsize14, autopct 21.1 . Shadow - True) : Messages["category"].valuecounts().plot(kind plt.label("Spar S Hant") pit. legendar, Span"} plt.show() From above Pie chart, it can be seen that about 85% of our dataset consists of non-spam messages + As we split our data set into train and test, strated campling is recommended in case, otherwise we have a chance of our training model being skewed towards normal messages. I the sample we choose to train our model consists mejorty of normal messages, it may end up predicting everything as ham and we might not be able to figure this out since most of the messages weget are actually ham and will have a pretty good accuracy now let us check Individual Spamtham words In [ ]: spantes essages (messages "category"] han_messages - messages messages ["category") - "spa"]["text"] "halt"text") span words han words ] Since this is just classifying the massage ds spoe or noe, we can use Esatphal- This will also remove the not word in something the can't ote. In a sentisent analysis setting, it's better to use sentence.transtote(string.mobetrans(", ), chors to ) def extractSpaskordsspartiessages)! global spas words words - word.lower() for word in word_tokenize(passages spar words spas words words 1 word lower not in stopwords.words("onglish") and word.lowi def extract Hawords (hassages): words - word.lower() for word in word_tokenize(hawlessages) if word.lower) not in stopwords.words(anglish") and word. How has words words + words span rescates.apply(extract Spar words) han messages.apply(extractHarwords) In [J: print("Total Messages: lonchas messages) lon(spas messages) ) T : Top 1e spae words span words = np.array( span words) print("Top 1e span words are un pd. Series(spas words).valuecounts().leadin -18) In 1: Tey Hoe words han words = np.array(han words) print("Top 1e Han words are : ) pd. Series(has words).valuecounts .headin10) Does the length of the meccage Indicate us anything? In 1: messages messages[m rossago Length" ] - messages ["text").apply(len) age Length.describe In 1: f, axpit.subplots(2, 1, figsize. (6, 18)) sns.clstplot (Nessages[ressagest "category") - "spa"messageLength"]. bins - 2a, ax ax[@].set_labell Span Message Nord Length) sns.distplot (essages[messages["category"] = "han] [eessageLength"), bins - 20, ax ax[e]) ax[1) In 1: from nitk.stes Sport SnowballStorer def steartext): text.split() for 1 in text: Storm words + Sowballstener(english) (stomerston (1) return def puncStopw/text): text text text.translaterstr.raketrans , string.punctuation) word for word in text.split() if word. lower) not in stopwords.words glish return ".join(text) messages["text"=ressages "text").apply(stemmer) messages "text" - sagtestext").apply(encStopal) messages.headin10) You may coepare the different between orginaltext and filtered one to see the difference Convert the clean text into a feature representation In 1: from sklearn.feature extraction. text Suport Count vectorizer VPC Count Vectorizer() features_rp - vec.fit_transform(essages ["text"]).toarray() #converting to array print features p.shape) 2. MODEL APPLICATION In this section, you will implement the Naive Bayes Classifier to the input data and predct a given email is spam or ham 2.1 Firstly, convert category of SPAM and HAM messages into 1 and 0, respectively. And then split the data into training set and test set In 1: print Messages "category"]) def oncodeCategory[cat): if cat "Spa": essages ["category") - messages ["category").apply(encodeCategory) #convert data to y array Messages r e ssages[category .to_nuspy) print(Gessages no) In [J: from sklearn.sodel_selection import train test split x train, x_test, y train, y test - train test_split features np, ve s np. stratify - messages np. test size 8.3, Pandon sta In 1: print("The size of a nossageslen(hannessages >> print("The size of span messages:", len span messages) print("The size of total samples", lon(essages[ "category"]>) print("The size of traing samples!". X train.shape) print("The size of testing samples: ".x_test.shape) print("The size of spar message in training samples", lenty trainly train print"The size of hannessages in training samples: lomy_trainty_train 1]> ]>>