Question
A spoiler for 2(a): there is a deficiency in how we implemented the get_words function. When we are counting words, we probably dont care whether
A spoiler for 2(a): there is a deficiency in how we implemented the get_words function. When we are counting words, we probably dont care whether the word was adjacent to a punctuation mark. For example, the word hatter appears in the text 57 times, but if we queried the count_words dictionary, we would see it only appeared 24 times. However, it also appeared numerous times adjacent to a punctuation mark, so those instances got counted separately:
word_freq = words_by_frequency(words) for (word, freq) in word_freq: if 'hatter' in word: print('{:10} {:3d}'.format(word, freq)) hatter 24 hatter. 13 hatter, 10 hatter: 6 hatters 1 hatter's 1 hatter; 1 hatter.' 1
Our get_words function would be better if it separated punctuation from words. We can accomplish this by using the re.split function. Be sure to import re to make re.split() work. Below is a small example that demonstrates how str.split works on a small text and compares it to using re.split:
text = '"Oh no, no," said the little Fly, "to ask me is in vain."' text.split() ['"Oh', 'no,', 'no,"', 'said', 'the', 'little', 'Fly,', '"to', 'ask', 'me', 'is', 'in', 'vain."'] re.split(r'(\W)', text) ['', '"', 'Oh', ' ', 'no', ',', '', ' ', 'no', ',', '', '"', '', ' ', 'said', ' ', 'the', ' ', 'little', ' ', 'Fly', ',', '', ' ', '', '"', 'to', ' ', 'ask', ' ', 'me', ' ', 'is', ' ', 'in', ' ', 'vain', '.', '', '"', '']
Note that this is not exactly what we want, but it is a lot closer. In the resulting list, we find empty strings and spaces, but we have also successfully separated the punctuation from the words.
Using the above example as a guide, write and test a function called tokenize that takes a string as an input and returns a list of words and punctuation, but not extraneous spaces and empty strings. Like get_words, tokenize should take an optional argument do_lower that determines whether the string should be case normalized before separating the words. You dont need to modify the re.split() line: just remove the empty strings and spaces.
tokenize(text, do_lower=True) ['"', 'oh', 'no', ',', 'no', ',', '"', 'said', 'the', 'little', 'fly', ',', '"', 'to', 'ask', 'me', 'is', 'in', 'vain', '.', '"'] print(' '.join(tokenize(text, do_lower=True))) " oh no , no , " said the little fly , " to ask me is in vain . "
Checking In
Use your tokenize function in conjunction with your count_words function to list the top 5 most frequent words in carroll-alice.txt. You should get this:
' 2871 <-- single quote , 2418 <-- comma the 1642 . 988 <-- period and 872
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started