Question
Write a program that reports the five most frequent two-word sequences in a text file download from Project Gutenberg. The program shall: Find the beginning
Write a program that reports the five most frequent two-word sequences in a text file download from Project Gutenberg. The program shall:
- Find the beginning and the end of the text (look for the markers "*** START OF THE PROJECT GUTENBERG EBOOK..." and "*** END OF THE PROJECT GUTENBERG EBOOK...") and discard everything before the beginning and after the end, including the markers.
- Break the text into words using spaces as separators.
- Convert each word to the lower case and remove the punctuation, if any. If a "word" consists only of punctuation, discard it entirely. Thus, "Huck Finn is drawn from life ; Tom Sawyer also, but" shall become "huck finn is drawn from life tom sawyer also but".
- Count all combinations of two consecutive words (they are known as bigrams -- e.g., "huck finn," "finn is," "is drawn," "drawn from") and report the five most frequent of them.
Test your program by counting bigrams in The Adventures of Tom Sawyer, by Mark Twain. Do not write code for downloading the file.
Deliverables: the Python file and the output of the program as a text file with the bigrams and their counts, one result per line, ordered in the decreasing order of counts (the most frequent bigram at the top).
I WILL GIVE YOU UPVOTE ONLY IF YOU DELIVER EXACTLY HOW IT WANTS ABOVE. *if not you get downvote, please read the problem carefully*
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started