Question
We can use the NLTK stopwords as below: [ ] import nltk from nltk.corpus import stopwords nltk.download('stopwords') sw_list = stopwords.words('english') sw_list[:10] # show some examples
We can use the NLTK stopwords as below:
[ ]
import nltk from nltk.corpus import stopwords nltk.download('stopwords') sw_list = stopwords.words('english') sw_list[:10] # show some examples
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
TODO -- Complete the following function; this function takes a string and returns a list of words in lowercase which are not in the stopwords list.
[6]
def text_prep(s): # TODO -- write your code here # remove stopwords # lowercase # tokenize output = [] for word in s.split(): if word in sw_list: continue output.append(str.lower(word)) return output
TODO -- Apply the function text_prep on data_tr.data and append the outputs to data_tr_prep. data_tr_prep will be a list of documents where each document is a list of preprocessed words.
[ ]
data_tr_prep = [] # TODO -- write your code here
100%|| 11314/11314 [00:07<00:00, 1590.17it/s]
data_tr_prep[0] should print a list of words similar to the following list:
['from:', 'lerxst@wam.umd.edu', "(where's", 'thing)', 'subject:', 'car', 'this!?', 'nntp-posting-host:', 'rac3.wam.umd.edu', 'organization:', 'university', 'maryland,', ... ]
Build a Vocabulary
Now we need to build a vocabulary which contains a fixed number of unique words. Only the words in the vocabulary will be used in the prediction process.
Let's set a reasonable size of vocabulary (i.e., V = 10000)
We will use a Python Counter to count all the words appear in the entire training dataset. This counter is a dictionary of key-value (word-frequency) pairs.
[ ]
from collections import Counter, defaultdict V = 10000 C = len(data_tr.target_names) cnt_words = Counter() for d in data_tr_prep: cnt_words.update(d)
.most_common(n) will return the n most frequent words in the counter, as below: Let's not worry about the punctuation words for now.
[ ]
cnt_words.most_common(20)
[('>', 27843), ('subject:', 11644), ('from:', 11590), ('lines:', 11337), ('organization:', 10881), ('|', 10072), ('-', 9662), ('would', 8654), ('re:', 7857), ('--', 7639), ('writes:', 7505), ('one', 7481), ('|>', 6521), ('article', 6466), ('like', 5345), ('x', 4982), ('people', 4859), ('get', 4785), ('nntp-posting-host:', 4777), (':', 4710)]
TODO -- Build mappings between tokens (words) and their index numbers.
We create a data structure for the vocabulary of V words. You can use cnt_words.most_common(V) to get the top V most frequent words.
tok2idx should map a token to its index number and idx2tok should be a list of words in the frequency order.
[ ]
idx2tok = list() tok2idx = dict() # TODO -- write your code here to populate the idx2tok and tok2idx
You should see results like below:
> idx2tok[:10] ['>', 'subject:', 'from:', 'lines:', 'organization:', '|', '-', 'would', 're:', '--'] > tok2idx['would'] 7
Training a NB Classifier
Naive Bayes classifier is a simple conditional probability model based on applying Bayes' theorem with strong feature independence assumption. For more details, you should carefully read the lecture slides.
In essense, we need to build a classifier that computes the following:
argmaxcCP(c)wdP(w|c)
That is, for each class c, we compute the product of the class prior P(c) and the conditional probabilities of words given the class P(w|c) in a document d.
To do this, we need to estimate the prior class probabilities P(c) and the conditional probabilities P(w|c). We will use the normalized frequencies to estimate these probabilities.
For example, P(c=rec.autos) can be estimated by the number of documents that belong to the class divided it by the total number of documents.
Likewise, P(w=car|c=rec.autos) can be estimated by the fraction of the word w appears among all words in documents of the class c.
To handle the zero probability issue, you should also apply the 'add-1' smoothing. See the lecture slides.
Now, the following Numpy arrays (i.e, cond_prob and prior_prob) will contain the estimated probabilities.
[ ]
import numpy as np cond_prob = np.zeros((V, C)) prior_prob = np.zeros((C))
TODO -- Increment the counts and normalize them properly so that they can be use as the probabilities.
[ ]
for d, c in zip(data_tr_prep, data_tr.target): # TODO -- Finish this for loop block. for t in d: if t in tok2idx: cond_prob[tok2idx[t], c] += 1 prior_prob[c] += 1
prior_prob should look something like this:
array([0.04242531, 0.05161747, 0.05223617, 0.05214778, 0.05108715, 0.05241294, 0.05170585, 0.05250133, 0.05285487, 0.05276648, 0.05303164, 0.05258971, 0.05223617, 0.05250133, 0.05241294, 0.05294326, 0.04825879, 0.04984974, 0.04109952, 0.03332155])
cond_prob[10] should look something like this:
array([0.00802263, 0.00404768, 0.00520794, 0.00410638, 0.00516728, 0.00250812, 0.00143359, 0.0081197 , 0.00944117, 0.00747272, 0.00482113, 0.00474687, 0.0053405 , 0.00616861, 0.00579096, 0.00451822, 0.00591574, 0.00497174, 0.00676319, 0.00629697])
Inference
You will test your classifier with unseen examples (test dataset).
TODO -- Apply text_prep on data_ts in the same way as you did earlier.
[ ]
data_ts = fetch_20newsgroups(subset='test', shuffle=True, random_state=42) data_ts_prep = [] # TODO -- Apply text_prep on data_ts and fill in data_ts_prep
100%|| 7532/7532 [00:04<00:00, 1718.21it/s]
Now, make a prediction.
For each test document, compute the "argmax" formula shown earlier. The argmax should tell you the class that maximizes the product of the prior/conditional probabilities.
You should apply log to the product for computational stability and less expansive computation. Computer prefers addition to multiplication.
[ ]
import math math.log(2)
0.6931471805599453
[ ]
import numpy as np V=3 cond_prob = np.zeros((10, 2)) cond_prob = cond_prob + 1 cond_prob /= (np.sum(cond_prob, axis=0) + V) cond_prob
array([[0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308], [0.07692308, 0.07692308]])
[ ]
import math
pred = []
current_high = 0
for d, c in zip(data_ts_prep, data_ts.target):
for c_i in range(len(data_ts.target_names)):
p = math.log(prior_prob[c_i])
for t in d:
p = p + math.log(cond_prob[tok2idx[t], c_i])
#sum up the logprob
# TODO -- implement this for loop to make predictions
Once, you made all the predictions for the testing examples, you can run a evaluation metric for accuracy.
If everything is correct, you should get around 70-77% accuracy.
[ ]
fromsklearn.metricsimportaccuracy_score
accuracy_score(data_ts.target,pred)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started