Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

A spoiler for 2(a): there is a deficiency in how we implemented the get_words function. When we are counting words, we probably dont care whether

A spoiler for 2(a): there is a deficiency in how we implemented the get_words function. When we are counting words, we probably dont care whether the word was adjacent to a punctuation mark. For example, the word hatter appears in the text 57 times, but if we queried the count_words dictionary, we would see it only appeared 24 times. However, it also appeared numerous times adjacent to a punctuation mark, so those instances got counted separately:

word_freq = words_by_frequency(words) for (word, freq) in word_freq: if 'hatter' in word: print('{:10} {:3d}'.format(word, freq)) hatter 24 hatter. 13 hatter, 10 hatter: 6 hatters 1 hatter's 1 hatter; 1 hatter.' 1

Our get_words function would be better if it separated punctuation from words. We can accomplish this by using the re.split function. Be sure to import re to make re.split() work. Below is a small example that demonstrates how str.split works on a small text and compares it to using re.split:

text = '"Oh no, no," said the little Fly, "to ask me is in vain."' text.split() ['"Oh', 'no,', 'no,"', 'said', 'the', 'little', 'Fly,', '"to', 'ask', 'me', 'is', 'in', 'vain."'] re.split(r'(\W)', text) ['', '"', 'Oh', ' ', 'no', ',', '', ' ', 'no', ',', '', '"', '', ' ', 'said', ' ', 'the', ' ', 'little', ' ', 'Fly', ',', '', ' ', '', '"', 'to', ' ', 'ask', ' ', 'me', ' ', 'is', ' ', 'in', ' ', 'vain', '.', '', '"', '']

Note that this is not exactly what we want, but it is a lot closer. In the resulting list, we find empty strings and spaces, but we have also successfully separated the punctuation from the words.

Using the above example as a guide, write and test a function called tokenize that takes a string as an input and returns a list of words and punctuation, but not extraneous spaces and empty strings. Like get_words, tokenize should take an optional argument do_lower that determines whether the string should be case normalized before separating the words. You dont need to modify the re.split() line: just remove the empty strings and spaces.

tokenize(text, do_lower=True) ['"', 'oh', 'no', ',', 'no', ',', '"', 'said', 'the', 'little', 'fly', ',', '"', 'to', 'ask', 'me', 'is', 'in', 'vain', '.', '"'] print(' '.join(tokenize(text, do_lower=True))) " oh no , no , " said the little fly , " to ask me is in vain . "

Checking In

Use your tokenize function in conjunction with your count_words function to list the top 5 most frequent words in carroll-alice.txt. You should get this:

' 2871 <-- single quote , 2418 <-- comma the 1642 . 988 <-- period and 872

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions