Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Expected Behavior Write a program, in a file ngrams.py , that behaves as follows: Use input() (without any argument) to read the name of an

Expected Behavior

Write a program, in a file ngrams.py, that behaves as follows:

Use input() (without any argument) to read the name of an input file. Read the contents of this file into a list of words, splitting at whitespace.

For each word in this list of words, strip away any punctuation at the beginning or end of the word (see below). Discard any empty strings that may result from stripping out of punctuation characters.

Use input() (without any argument) to read in an integer value n.

Construct n-grams as described here and count the number of occurrences of each distinct n-gram. These counts should be maintained in a case-insensitive way, i.e., "hello", "HELLO", and "hElLo" should all be considered to be the same word.

Find the maximum number of times M any n-gram occurrs. Print out all n-grams that occur M times, one per line, as described under Output below. Some examples are given here.

Before computing n-grams, the list of words obtained from the input files should be "cleaned". This can be done as follows:

Remove any punctuation characters from either end of each word (punctuation characters that are not at the end of a word need not be removed. For example, the list of words

['"Uh-oh!"', 'she', 'thought.', '"Trouble!"']

would give the following list after cleaning:

['Uh-oh', 'she', 'thought', 'Trouble']

Discard any empty strings, which can result from "cleaning" strings consisting entirely of punctuation characters.

Input

The input will be a file consisting of ordinary text and punctuation. An example is given here.

Output

The output of your program should be a sequence of those n-grams that occur the largest number of times in the input file. N-grams should be printed out one per line according to the following format:

"{:d} -- {}".format(M, ngram_string)

where:

M is the maximum number of times any n-gram occurrs, and

ngram_string is a string onbained by concatenating together the words in the n-gram, with a single blank letter separating each pair of adjacent words, and no additional blank characters at the beginning or end of the string.

Some examples are given here.

Assertions

Your program should use assert statements to check the following (however, see below for replacements for asserts in situations where asserts are difficult to state).

For each method, any pre-conditions on the arguments for that method. The assert should be placed at or very soon after the beginning of the method.

For each loop that either (i) computes a value; or (ii) transforms some data, at least one assert that holds in the body of the loop. You can choose what the invariant is, but it must be something that:

reflects the computation of the loop; and

is not simply a statement of the iteration condition (or some expression whose value follows directly from the iteration condition).

Asserts are not necessary for loops that neither compute values nor transform data, e.g., loops that simply read in data or print out results.

This level of assertion-checking is almost certainly overkill, but we'll do this for a little while in order to get more experience with pre-conditions and loop invariants and to practise working with assert statements.

Try to make your asserts as precise and specific as you can.

Replacements for asserts

In some situations, it may be difficult or impossible to write an assert that captures what you want to capture. In such situations, in place of an assert you can write a comment giving the invariant or assumption you want to state. Such a comment should be written as follows:

### INVARIANT: ...your invariant in English and/or Python

or

### ASSUMPTION: ...your assumption in English and/or Python

Programming Requirements

Your program should implement (at least) the following classes along with the methods specified.

class Input

An object of this class represents information about the input data. This class should implement (at least) the following methods:

__init__(self) : uses input() to read in the name of an input file, opens the file, and initializes an attribute of the object with the file object returned by open().

wordlist(self) : reads in the contents of the input file, splits it into a list of words, removes any punctuation at either end of the word (punctuation characters inside the word should not be removed), discards any empty strings that may result, and returns the resulting list.

class Ngrams

An object of this class represents n-gram information about the input list of words. This class should implement (at least) the following methods:

__init__(self) : uses input() to read in the value of n for computing n-grams and initializes an attribute of the object with this value. It also initializes the data structure that will be used to keep track of the number of times each n-gram occurs in the input.

update(self, ngram) : updates the count for the n-gram ngram.

process_wordlist(self, wordlist) : processes the list of words wordlist read from the input file to compute the number of times each n-gram occurs in wordlist.

print_max_ngrams(self) : prints out the n-grams that have the highest number of occurrences according to the output format specified above (see Output).

Examples

Several examples of this data analysis are given here.

Testing

You may want to consider some of the following for black-box testing purposes:

empty input file;

input file has fewer words than the value of n specied for n-grams

n = 0

input words on just a single line

input words one per line on multiple lines

n-grams where some of the words have punctuation on either side

n-grams where some of the words have upper/lower-case differences

In addition, you should examine your code to identify inputs for white-box testing.

nyt02.txt

All most people want from American Family Publishers is a big check delivered by Ed McMahon , but Geraldine Biggs would like an apology as well .While Biggs awaits word of a dying sister in Illinois , she is being besieged by phone calls from people across the country who think she is connected to the Andy Biggs of Phoenix who won $ 10 million from the sweepstakes .As part of their advertising campaign , American Family sent out notices encouraging people to call Andy Biggs of Phoenix to learn how he got the big payoff .Trouble is , Andy Biggs is not listed in the phone book but G.A. Biggs Geraldine is ." I 've got people calling me all the time .I tell them that I 'm not that person but they still call me , " she said .The Phoenix woman said she had a sprinkling of calls for a couple of weeks but the deluge started last week .Sweepstakes operators usually start marketing their contests heavily after Christmas ." That doesn't always sweeten my disposition , " she said .Biggs has been called from New Hampshire , Ohio , South Carolina , Georgia , Kentucky , Arkansas , Florida , California and Oregon .The calls have ranged from the touching to the conniving , including one fellow who tried to convince her he was Andy Biggs ' cousin ." He didn't know anything about the family , " she said ." He was very young and very stupid and hadn't prepared his answers very well . "Other calls were more heartwarming .One came from a little girl whose mother had cancer and whose father couldn't work because he had injured his back .All the girl wanted was some advice on how to win the sweepstakes .Biggs , who doesn't really believe in the power of the sweepstakes , told her to fill out the form carefully and make sure she put a stamp on it .But the callers that really frost Briggs are those who don't believe her when she says she doesn't know Andy Biggs ." They act like I 'm hiding something from them , but I 'm not .But if I knew him , would I be here ? I 'd own my own Lear jet and not be here . "Sure Biggs would like Andy Biggs ' money but she also covets his anonymity .She has called the Better Business Bureau , the Arizona Attorney General 's Office , US West and direct-mail associations to get the phones to stop ringing ." They all tell me there 's nothing to do . "American Family Publishers didn't return phone calls from the media .By now , Biggs isn't so so sure there 's an Andy Biggs out there and she 's not too keen on the sweepstakes , either .On Saturday , she got her own " Wait , you could already be a winner " invitation to the American Family 's sweepstakes .She threw it away ." I want to be left alone , " she said ." No , I want a letter of apology from those two ugly baboons on television McMahon and Dig Clark , American Family spokesmen .If it 's not bad enough , with the phone ringing I 've got to see them on television . "Biggs realizes she sounds cross but she is frustrated by the calls .There 's only one phone call she is interested in : news of her sister who is expected to die soon . 

expected output :

Input: nyt02.txt Output:

n = 1 n = 2 n = 3
26 -- the 7 -- andy biggs 2 -- american family publishers 2 -- biggs would like 2 -- i m not 2 -- phone calls from 2 -- biggs of phoenix 2 -- andy biggs of 2 -- i ve got

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions