Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

We can use the NLTK stopwords as below: [ ] import nltk from nltk.corpus import stopwords nltk.download('stopwords') sw_list = stopwords.words('english') sw_list[:10] # show some examples

We can use the NLTK stopwords as below:

[ ]

 
 
import nltk from nltk.corpus import stopwords nltk.download('stopwords') sw_list = stopwords.words('english') sw_list[:10] # show some examples 
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. 
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

TODO -- Complete the following function; this function takes a string and returns a list of words in lowercase which are not in the stopwords list.

[6]

 
 
def text_prep(s): # TODO -- write your code here # remove stopwords # lowercase # tokenize output = [] for word in s.split(): if word in sw_list: continue output.append(str.lower(word)) return output 

TODO -- Apply the function text_prep on data_tr.data and append the outputs to data_tr_prep. data_tr_prep will be a list of documents where each document is a list of preprocessed words.

[ ]

 
 
data_tr_prep = [] # TODO -- write your code here 
100%|| 11314/11314 [00:07<00:00, 1590.17it/s] 

data_tr_prep[0] should print a list of words similar to the following list:

['from:', 'lerxst@wam.umd.edu', "(where's", 'thing)', 'subject:', 'car', 'this!?', 'nntp-posting-host:', 'rac3.wam.umd.edu', 'organization:', 'university', 'maryland,', ... ]

Build a Vocabulary

Now we need to build a vocabulary which contains a fixed number of unique words. Only the words in the vocabulary will be used in the prediction process.

Let's set a reasonable size of vocabulary (i.e., V = 10000)

We will use a Python Counter to count all the words appear in the entire training dataset. This counter is a dictionary of key-value (word-frequency) pairs.

[ ]

 
 
from collections import Counter, defaultdict V = 10000 C = len(data_tr.target_names) cnt_words = Counter() for d in data_tr_prep: cnt_words.update(d) 

.most_common(n) will return the n most frequent words in the counter, as below: Let's not worry about the punctuation words for now.

[ ]

 
 
cnt_words.most_common(20) 
[('>', 27843), ('subject:', 11644), ('from:', 11590), ('lines:', 11337), ('organization:', 10881), ('|', 10072), ('-', 9662), ('would', 8654), ('re:', 7857), ('--', 7639), ('writes:', 7505), ('one', 7481), ('|>', 6521), ('article', 6466), ('like', 5345), ('x', 4982), ('people', 4859), ('get', 4785), ('nntp-posting-host:', 4777), (':', 4710)]

TODO -- Build mappings between tokens (words) and their index numbers.

We create a data structure for the vocabulary of V words. You can use cnt_words.most_common(V) to get the top V most frequent words.

tok2idx should map a token to its index number and idx2tok should be a list of words in the frequency order.

[ ]

 
 
idx2tok = list() tok2idx = dict() # TODO -- write your code here to populate the idx2tok and tok2idx 

You should see results like below:

> idx2tok[:10] ['>', 'subject:', 'from:', 'lines:', 'organization:', '|', '-', 'would', 're:', '--'] > tok2idx['would'] 7

Training a NB Classifier

Naive Bayes classifier is a simple conditional probability model based on applying Bayes' theorem with strong feature independence assumption. For more details, you should carefully read the lecture slides.

In essense, we need to build a classifier that computes the following:

argmaxcCP(c)wdP(w|c)

That is, for each class c, we compute the product of the class prior P(c) and the conditional probabilities of words given the class P(w|c) in a document d.

To do this, we need to estimate the prior class probabilities P(c) and the conditional probabilities P(w|c). We will use the normalized frequencies to estimate these probabilities.

For example, P(c=rec.autos) can be estimated by the number of documents that belong to the class divided it by the total number of documents.

Likewise, P(w=car|c=rec.autos) can be estimated by the fraction of the word w appears among all words in documents of the class c.

To handle the zero probability issue, you should also apply the 'add-1' smoothing. See the lecture slides.

Now, the following Numpy arrays (i.e, cond_prob and prior_prob) will contain the estimated probabilities.

[ ]

 
 
import numpy as np cond_prob = np.zeros((V, C)) prior_prob = np.zeros((C)) 

TODO -- Increment the counts and normalize them properly so that they can be use as the probabilities.

[ ]

 
 
for d, c in zip(data_tr_prep, data_tr.target): # TODO -- Finish this for loop block. for t in d: if t in tok2idx: cond_prob[tok2idx[t], c] += 1 prior_prob[c] += 1 

prior_prob should look something like this:

array([0.04242531, 0.05161747, 0.05223617, 0.05214778, 0.05108715, 0.05241294, 0.05170585, 0.05250133, 0.05285487, 0.05276648, 0.05303164, 0.05258971, 0.05223617, 0.05250133, 0.05241294, 0.05294326, 0.04825879, 0.04984974, 0.04109952, 0.03332155])

cond_prob[10] should look something like this:

array([0.00802263, 0.00404768, 0.00520794, 0.00410638, 0.00516728, 0.00250812, 0.00143359, 0.0081197 , 0.00944117, 0.00747272, 0.00482113, 0.00474687, 0.0053405 , 0.00616861, 0.00579096, 0.00451822, 0.00591574, 0.00497174, 0.00676319, 0.00629697])

Inference

You will test your classifier with unseen examples (test dataset).

TODO -- Apply text_prep on data_ts in the same way as you did earlier.

[ ]

 
 
data_ts = fetch_20newsgroups(subset='test', shuffle=True, random_state=42) data_ts_prep = [] # TODO -- Apply text_prep on data_ts and fill in data_ts_prep 
100%|| 7532/7532 [00:04<00:00, 1718.21it/s] 

Now, make a prediction.

For each test document, compute the "argmax" formula shown earlier. The argmax should tell you the class that maximizes the product of the prior/conditional probabilities.

You should apply log to the product for computational stability and less expansive computation. Computer prefers addition to multiplication.

[ ]

 
 
import math math.log(2) 
0.6931471805599453

[ ]

 
 
import numpy as np V=3 cond_prob = np.zeros((10, 2)) cond_prob = cond_prob + 1 cond_prob /= (np.sum(cond_prob, axis=0) + V) cond_prob 
array([[0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308]])

[ ]

import math

pred = []

current_high = 0

for d, c in zip(data_ts_prep, data_ts.target):

for c_i in range(len(data_ts.target_names)):

p = math.log(prior_prob[c_i])

for t in d:

p = p + math.log(cond_prob[tok2idx[t], c_i])

#sum up the logprob

# TODO -- implement this for loop to make predictions

Once, you made all the predictions for the testing examples, you can run a evaluation metric for accuracy.

If everything is correct, you should get around 70-77% accuracy.

[ ]

fromsklearn.metricsimportaccuracy_score

accuracy_score(data_ts.target,pred)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Machine Learning And Knowledge Discovery In Databases European Conference Ecml Pkdd 2022 Grenoble France September 19 23 2022 Proceedings Part 4 Lnai 13716

Authors: Massih-Reza Amini ,Stephane Canu ,Asja Fischer ,Tias Guns ,Petra Kralj Novak ,Grigorios Tsoumakas

1st Edition

3031264118, 978-3031264115

More Books

Students also viewed these Databases questions

Question

1. Outline the listening process and styles of listening

Answered: 1 week ago

Question

4. Explain key barriers to competent intercultural communication

Answered: 1 week ago