Answered step by step
Verified Expert Solution
Question
1 Approved Answer
An n - gram - - in the context of parsing natural languages such as English - - is a sequence of n consecutive tokens
An ngram in the context of parsing natural languages such as English is a sequence of n consecutive tokens which we might define as characters separated by whitespace from some passage of text. Based on the following passage:
I really really like cake. We have the following grams:
I 'really'really 'really'really 'like'like 'cake. And the following grams:
I 'really', 'really'
really 'really', 'like'
really 'like', 'cake.Note that we omit a gram listing because it would merely be a list of all tokens in the original text.
Among other things, ngrams are useful for describing the vocabulary of and statistical correlation between tokens in a sample body of text eg as taken from a book We can use an ngram model to determine the likelihood of finding a particular sequence of words after another. This information, in turn, can be used to generate passages of text that statistically mimic the sample.
We can convert the above gram list into the following lookup structure ie a dictionary mapping strings to lists of tuples where the first token of each ngram maps to all sequences that follow it in the text:
I: really 'really'
'really': really 'like'like 'cake. We can now generate passages of text using the following method:
Select a random key and use it as the start token of the passage. It will also serve as the current token for the next step.
Select a random tuple from the list associated with the current token and append the sequence to the passage. The last token of the selected sequence will be the new current token.
If the current token is a key in the dictionary then simply repeat step otherwise select another random key from the dictionary as the current token and append it to the passage before repeating step
Eg we might start by selecting I in step which gives us really 'really' as our only choice in The second 'really' in that tuple is the new current token which is a valid key which takes us back to and gives us a choice between two tuples. If we choose like 'cake. then we have 'cake. as our new current token it is not a key in the map, however, so we'd have to choose a new random key if we wanted to generate a longer passage. Either way, the passage we've generated thus far is 'I really really like cake.which also happens to be the original passage
Here's a lengthier passage that could be generated from the gram dictionary above note that for clarity I've added s every time a new random key is selected ie when the previous token isn't a key in the dictionary:
really like cake. I really really really like really like cake. I really really really like really This gets more interesting when we build ngram dictionaries from lengthier bodies of text. For instance, the following text was generated with a little programmed embellishment for prettier capitalization and punctuation from a gram dictionary extracted from Romeo's famous balcony monologue:
Lamp her eyes were there they in their spheres till they in her eyes in all the fairest stars in all the heaven having some business do wear it is my love! O it is envious her cheek would through the heaven having some business do entreat her eyes were there they in their spheres till they in her eyes to
For reference, here is the dictionary entry for the token 'her' used to generate the above:
'her': maid 'art'
maid 'since'
vestal 'livery'
eyesto
eyes 'were'
head 'The'
cheek 'would'
eyesin
cheek 'upon'
handO
Your assignment is to implement a function that constructs an ngram dictionary from a list of strings tokens and another that returns a passage of text generated from a given ngram dictionary. All your code will go into the ngrams.py source file.
Your first task is to implement computengrams, which will take a list of tokens, a value n indicating the ngram length eg for grams and return an ngram dictionary. The keys in the returned dictionary should all be strings, whose values will be lists of one or more tuples. Note that even in the case of nwhich would be the minimum value the dictionary should map strings to lists of tuples ie instead of to lists of individual tokens
Next, you will implement genpassage, which will take an ngram dictionary and a length for the passage to generate as a token count
Select a random key from the dictionary and use it as the start token of the passage. It will also serve as the current token for the next step.
Select a random tuple from the list associated with the current token and append the sequence to the passage. The last token of the selected sequence will be the new current token If the current token is a key in the dictionary then simply repeat
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started