Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Create a tokenizer Write a function tokenize that takes a string and returns a list of tokens. [ ] deftokenize(doc): return Calculate token scores Calculate

Create a tokenizer


Write a function tokenize that takes a string and returns a list of tokens.


[ ] deftokenize(doc): return

Calculate token scores


Calculate scores for every token in the corpus, using the method discussed in class. Store these scores in a dictionary called token_scores.


[ ] token_scores={}

Create a score message function


Write a function score_message that takes an SMS message doc and returns a SPAM score, using the method discussed in class.


[ ] defscore_message(doc):

return


  1. What tokens are most predictive of a message being SPAM? (coding)
  2. What tokens are most predictive of a message being HAM? (coding)
  3. How many documents are misclassified by the model?(coding)

import urllib.request, json 

sms_corpus = []

with urllib.request.urlopen("https://storage.googleapis.com/wd13/SMSSpamCollection.txt") as url:

  for line in url.readlines():

    sms_corpus.append(line.decode().split('t'))

 

# print the text and label of document 10

docid = 16

print(sms_corpus[docid])

 

# print the label of document 10

docid = 16

print(sms_corpus[docid][0])

 

# print the text of document 11

docid = 16

print(sms_corpus[docid][1])

image
  1. Can you get better results by improving your tokenizer?

- SMS SPAM Collection The SMS SPAM Collection is a corpus of real text messages (SMS messages) that have been classified as either SPAM or HAM (i.e. not SPAM). The corpus contains 5,574 documents, 747 of which are SPAM and 4,827 of which are HAM. You can find the readme for the corpus here. The following code downloads a copy of the SMS SPAM Corpus and saves it in a variable sms_corpus. import urllib.request, json sms_corpus = [] with urllib.request.urlopen ("https://storage.googleapis.com/wd13/SMSSpamCollection.txt") as url: for line in url.readlines(): sms_corpus.append(line.decode().split('\t')) sms_corpus is a list. Each element of the list is another list which stores a document and its label. [ ] # print the text and label of document 10 docid= 16 print(sms_corpus[docid]) ["ham", "Oh k...i'm watching here:) "] [] # print the label of document 10 docid= 16 print(sms_corpus[docid] [0]) ham [] # print the text of document 11 docid= 16 print (sms_corpus[docid][1]) Oh k...i'm watching here:)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

To complete the tasks youve outlined lets start by creating a tokenizer function Then well calculate token scores based on the method discussed in cla... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Artificial Intelligence A Modern Approach

Authors: Stuart Russell, Peter Norvig

4th Edition

0134610997, 978-0134610993

More Books

Students also viewed these Programming questions