Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Help me figure out whats wrong with my python code. Thats the code: import nltk import re import pickle raw = open('tom_sawyer_shrt.txt').read() ### this is

Help me figure out whats wrong with my python code.

Thats the code:

import nltk import re import pickle

raw = open('tom_sawyer_shrt.txt').read()

### this is how the basic Punkt sentence tokenizer works #sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') #sents = sent_tokenizer.tokenize(raw)

### train & tokenize text using text sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer().train(raw) sent_tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer(sent_trainer) # break in to sentences sents = sent_tokenizer.tokenize(raw) # get sentence start/stop indexes sentspan = sent_tokenizer.span_tokenize(raw)

### Remove in the middle of setences, due to fixed-width formatting for i in range(0,len(sents)-1): sents[i] = re.sub('(?

for i in range(1,len(sents)): if (sents[i][0:3] == '" '): sents[i-1] = sents[i-1]+'" ' sents[i] = sents[i][3:]

### Loop thru each sentence, fix to 140char i=0 tweet=[] while (i 140): ntwt = int(len(sents[i])/140) + 1 words = sents[i].split(' ') nwords = len(words) for k in range(0,ntwt): tweet = tweet + [ re.sub('\A\s|\s\Z', '', ' '.join( words[int(k*nwords/float(ntwt)): int((k+1)*nwords/float(ntwt))] ))] i=i+1 else: if (i

### A last pass to clean up leading/trailing newlines/spaces. for i in range(0,len(tweet)): tweet[i] = re.sub('\A\s|\s\Z','',tweet[i])

for i in range(0,len(tweet)): tweet[i] = re.sub('\A" ','',tweet[i])

### Save tweets to pickle file for easy reading later output = open('tweet_list.pkl','wb') pickle.dump(tweet,output,-1) output.close()

listout = open('tweet_lis.txt','w') for i in range(0,len(tweet)): listout.write(tweet[i]) listout.write(' ----------------- ')

listout.close()

And thats the eroor message:

Traceback (most recent call last): File "twain_prep.py", line 13, in sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer().train(raw) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1227, in train token_cls=self._Token).get_params() File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 649, in __init__ self.train(train_text, verbose, finalize=True) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 713, in train self._train_tokens(self._tokenize_words(text), verbose) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 729, in _train_tokens tokens = list(tokens) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words for line in plaintext.split(' '): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Readings In Database Systems

Authors: Michael Stonebraker

2nd Edition

0934613656, 9780934613651

Students also viewed these Databases questions