Question
Create a tokenizer Write a function tokenize that takes a string and returns a list of tokens. [ ] deftokenize(doc): return Calculate token scores Calculate
Create a tokenizer
Write a function tokenize that takes a string and returns a list of tokens.
[ ] deftokenize(doc): return
Calculate token scores
Calculate scores for every token in the corpus, using the method discussed in class. Store these scores in a dictionary called token_scores.
[ ] token_scores={}
Create a score message function
Write a function score_message that takes an SMS message doc and returns a SPAM score, using the method discussed in class.
[ ] defscore_message(doc):
return
- What tokens are most predictive of a message being SPAM? (coding)
- What tokens are most predictive of a message being HAM? (coding)
- How many documents are misclassified by the model?(coding)
import urllib.request, json
sms_corpus = []
with urllib.request.urlopen("https://storage.googleapis.com/wd13/SMSSpamCollection.txt") as url:
for line in url.readlines():
sms_corpus.append(line.decode().split('t'))
# print the text and label of document 10
docid = 16
print(sms_corpus[docid])
# print the label of document 10
docid = 16
print(sms_corpus[docid][0])
# print the text of document 11
docid = 16
print(sms_corpus[docid][1])
- Can you get better results by improving your tokenizer?
- SMS SPAM Collection The SMS SPAM Collection is a corpus of real text messages (SMS messages) that have been classified as either SPAM or HAM (i.e. not SPAM). The corpus contains 5,574 documents, 747 of which are SPAM and 4,827 of which are HAM. You can find the readme for the corpus here. The following code downloads a copy of the SMS SPAM Corpus and saves it in a variable sms_corpus. import urllib.request, json sms_corpus = [] with urllib.request.urlopen ("https://storage.googleapis.com/wd13/SMSSpamCollection.txt") as url: for line in url.readlines(): sms_corpus.append(line.decode().split('\t')) sms_corpus is a list. Each element of the list is another list which stores a document and its label. [ ] # print the text and label of document 10 docid= 16 print(sms_corpus[docid]) ["ham", "Oh k...i'm watching here:) "] [] # print the label of document 10 docid= 16 print(sms_corpus[docid] [0]) ham [] # print the text of document 11 docid= 16 print (sms_corpus[docid][1]) Oh k...i'm watching here:)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
To complete the tasks youve outlined lets start by creating a tokenizer function Then well calculate token scores based on the method discussed in cla...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started