Question
The notion of n-grams (where n is a number, e.g., for n = 3, we have 3-grams, for n = 4, we have 4-grams, etc.)
The notion of n-grams (where n is a number, e.g., for n = 3, we have 3-grams, for n = 4, we have 4-grams, etc.) is a simple idea that has many applications, including computational linguistics, speech recognition, machine translation, plagiarism detection, genetic sequence analysis, and so on (see this Wikipedia page). This program involves writing a program to compute n-grams and count the number of times each n-gram is encountered.
An illustrative example explaining the notion of n-grams, together with an algorithm for computing n-grams, are given here.
Expected Behavior
Write a program, in a file ngrams.py, that behaves as follows:
Use input() (without any argument) to read the name of an input file. Read the contents of this file into a list of words, splitting at whitespace.
For each word in this list of words, strip away any punctuation at the beginning or end of the word (see below). Discard any empty strings that may result from stripping out of punctuation characters.
Use input() (without any argument) to read in an integer value n.
Construct n-grams as described here and count the number of occurrences of each distinct n-gram. These counts should be maintained in a case-insensitive way, i.e., "hello", "HELLO", and "hElLo" should all be considered to be the same word.
Find the maximum number of times M any n-gram occurrs. Print out all n-grams that occur M times, one per line, as described under Output below. Some examples are given here.
Before computing n-grams, the list of words obtained from the input files should be "cleaned". This can be done as follows:
Remove any punctuation characters from either end of each word (punctuation characters that are not at the end of a word need not be removed. For example, the list of words
['"Uh-oh!"', 'she', 'thought.', '"Trouble!"']
would give the following list after cleaning:
['Uh-oh', 'she', 'thought', 'Trouble']
Discard any empty strings, which can result from "cleaning" strings consisting entirely of punctuation characters.
Input
The input will be a file consisting of ordinary text and punctuation. An example is given here.
Output
The output of your program should be a sequence of those n-grams that occur the largest number of times in the input file. N-grams should be printed out one per line according to the following format:
"{:d} -- {}".format(M, ngram_string)
where:
M is the maximum number of times any n-gram occurrs, and
ngram_string is a string onbained by concatenating together the words in the n-gram, with a single blank letter separating each pair of adjacent words, and no additional blank characters at the beginning or end of the string.
Some examples are given here.
Assertions
Your program should use assert statements to check the following (however, see below for replacements for asserts in situations where asserts are difficult to state).
For each method, any pre-conditions on the arguments for that method. The assert should be placed at or very soon after the beginning of the method.
For each loop that either (i) computes a value; or (ii) transforms some data, at least one assert that holds in the body of the loop. You can choose what the invariant is, but it must be something that:
reflects the computation of the loop; and
is not simply a statement of the iteration condition (or some expression whose value follows directly from the iteration condition).
Asserts are not necessary for loops that neither compute values nor transform data, e.g., loops that simply read in data or print out results.
This level of assertion-checking is almost certainly overkill, but we'll do this for a little while in order to get more experience with pre-conditions and loop invariants and to practise working with assert statements.
Try to make your asserts as precise and specific as you can.
Replacements for asserts
In some situations, it may be difficult or impossible to write an assert that captures what you want to capture. In such situations, in place of an assert you can write a comment giving the invariant or assumption you want to state. Such a comment should be written as follows:
### INVARIANT: ...your invariant in English and/or Python
or
### ASSUMPTION: ...your assumption in English and/or Python
Programming Requirements
Your program should implement (at least) the following classes along with the methods specified.
class Input
An object of this class represents information about the input data. This class should implement (at least) the following methods:
__init__(self) : uses input() to read in the name of an input file, opens the file, and initializes an attribute of the object with the file object returned by open().
wordlist(self) : reads in the contents of the input file, splits it into a list of words, removes any punctuation at either end of the word (punctuation characters inside the word should not be removed), discards any empty strings that may result, and returns the resulting list.
class Ngrams
An object of this class represents n-gram information about the input list of words. This class should implement (at least) the following methods:
__init__(self) : uses input() to read in the value of n for computing n-grams and initializes an attribute of the object with this value. It also initializes the data structure that will be used to keep track of the number of times each n-gram occurs in the input.
update(self, ngram) : updates the count for the n-gram ngram.
process_wordlist(self, wordlist) : processes the list of words wordlist read from the input file to compute the number of times each n-gram occurs in wordlist.
print_max_ngrams(self) : prints out the n-grams that have the highest number of occurrences according to the output format specified above (see Output).
Examples
Several examples of this data analysis are given here.
Testing
You may want to consider some of the following for black-box testing purposes:
empty input file;
input file has fewer words than the value of n specied for n-grams
n = 0
input words on just a single line
input words one per line on multiple lines
n-grams where some of the words have punctuation on either side
n-grams where some of the words have upper/lower-case differences
In addition, you should examine your code to identify inputs for white-box testing.
Examples
Input: in0.txt Output:
n = 1 | n = 2 | n = 3 |
---|---|---|
2 -- bb 2 -- cc 2 -- aa | 2 -- bb cc 2 -- aa bb | 2 -- aa bb cc |
Input: in1.txt Output:
n = 1 | n = 2 | n = 3 |
---|---|---|
6 -- ccc | 4 -- aaa bbb 4 -- bbb ccc | 4 -- aaa bbb ccc |
Input: in2.txt Output:
n = 1 | n = 2 | n = 3 |
---|---|---|
11 -- that 11 -- the | 3 -- the people 3 -- to the 3 -- we cannot 3 -- it is | 2 -- dedicated to the |
Input: in3.txt Output:
n = 1 | n = 2 | n = 3 |
---|---|---|
58 -- the | 7 -- of the | 2 -- cause of the 2 -- war rather than 2 -- rather than let 2 -- by whom the 2 -- this interest was 2 -- whom the offense 2 -- the cause of 2 -- drawn with the |
Input: nyt01.txt Output:
n = 1 | n = 2 | n = 3 |
---|---|---|
15 -- the | 3 -- in the | 2 -- in december 1995 |
Input: nyt02.txt Output:
n = 1 | n = 2 | n = 3 |
---|---|---|
26 -- the | 7 -- andy biggs | 2 -- american family publishers 2 -- biggs would like 2 -- i m not 2 -- phone calls from 2 -- biggs of phoenix 2 -- andy biggs of 2 -- i ve got |
Input: nyt21.txt Output:
n = 1 | n = 2 | n = 3 |
---|---|---|
381 -- the | 38 -- in the | 7 -- the rose bowl |
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started