Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 23, 2024

*PLEASE ANSWER IN PYTHON ONLY*** Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement hash tables and hash

*******PLEASE ANSWER IN PYTHON ONLY*********

Learner Objectives -----------------

At the conclusion of this programming assignment, participants should be able to:

Implement hash tables and hash functions

Linear probing

Chaining

Implement maps

Understand a unigram language model

Prerequisites--------------------------

Before starting this programming assignment, participants should be able to:

Write object-oriented code in Python

Work with lists

Acknowledgments---------------------------

Content used in this assignment is based upon information in the following sources:

Predictive keyboard assignment from Montana Tech

Overview and Requirements-------------------------------

Natural language processing (NLP) refers to computational technique involving language. It is a broad field. For this assignment, we will learn a bit about NLP to build a predictive keyboard. The keyboard learns to make predictions by text given at startup and by things the user types into the program. As users type, the program predicts the most likely word given the currently entered word prefix. So for example, if the user has typed "th", the keyboard will probably predict "the".

We will use a hash map to implement the keyboard, which is crucial to making the keyboard's learning and predictions fast and efficient. For a hash map, we will use Python's built in dictionary object and our own Map class.

Note: for this assignment, do not use Jupyter Notebook to code your solution. Use standard .py files.

Program Details---------------------------------

Background: Language Modeling 101

Our predictive keyboard will make use of a very simple model of language, a unigram language model. Our keyboard will keep track of every unique word it has seen. For each unique word, it will count how many times that word has occurred. A simple maximum likelihood estimate for the unigram probability of a word is to take the number of times you've seen that word occur divided by the total number of words you've encountered. For example, here is a small bit of text:

The cat is in the corner of their garage.

There are 9 total words in our text. Ignoring case and punctuation, the word the appears twice. The unigram probability of "the" is thus P(the)=2/9P(the)=2/9. The unigram probability of "cat" is P(cat)=1/9P(cat)=1/9. The other six words: is, in, corner, of, their, garage all also have a probability of 1/9. If a word has never been seen, its probability is zero: P(zebra)=0/9=0.0P(zebra)=0/9=0.0.

Given the data, if the user were to type in "c", both "cat" and "corner" are equally likely. But if they type "ca", we would predict "cat", while if they typed "co" we'd predict "corner". If the user typed "ci", we would make not prediction (since none of our words start with "ci").

Datasets----------------------

Download the following text files (keyboard_files.zip):

moby_start.txt: The first paragraph from Moby-Dick (202 words). -listed at bottom

moby_end.txt: The last two paragraphs from Moby-Dick (232 words). - listed at bottom

mobydick.txt: The entire book Moby-Dick (209K words). - too long to include in chegg question

wiki_200k.txt: 200,000 sentences from Wikipedia (3.8M words). -too long to include in chegg question

Classes to Define---------------

Map

Implement a Map class that inherits from a HashTable class (see the class notes for much of this code). Your HashTable class should implement chaining for collision handling.

Note: I recommend first solving this assignment using the built in Python dictionary object. Once that is working, then replace the dictionaries with your Map objects.

DictEntry

We need a class to represent a word and its unigram probability. Here is the API to DictEntry:

# create a new entry given a word and probability __init__(word, prob) # string, float  # getter for the word  get_word() # returns string # getter for the probability get_prob() # returns float # does this word match the given pattern? match_pattern(pattern) # returns string

WordPredictor

The brains of this whole operation is the class WordPredictor. This class learns how to make word predictions based on being shown training data. It can learn from all the words in a file (via the train() method) or from a single individual word (via the train_word() method). After new training data is added, the build() method must be called so the class can recompute the most likely word for all possible prefixes. Here is the API for the WordPredictor class:

# train the unigram model on all the words in the given file train(training_file) # string # train on just a single word train_word(word) # string # get the number of total words we've trained on get_training_count() # returns integer # get the number of times we've seen this word (0 if never) get_word_count(word) # string. returns integer # recompute the word probabilities and prefix mapping build() # return the most likely DictEntry object given a word prefix get_best(prefix) # string. returns DictEntry

Training the Model-----------------------------

Model training occurs in the train() and train_word() methods. train() should parse out each word in the specified file on disk. If the file cannot be read, it should print out an error, "Could not open training file: file.txt". All training words should be converted to lowercase and stripped of any characters that are not a-z or the single apostrophe. During training, you need to update the instance variables:

word_to_count: Map of string (word key) to integer (count value) pairs

total: Integer

The word_to_count instance variable is a Map where the keys are the unique words encountered in the training data. The values are the integer count of how many times we've seen each word in the data. Only words seen in the training data will have an entry in the word_to_count Map. The total instance variable tracks how many words of training data we have seen. That is, total should equal the sum of all integer counts stored in your map. Training is cumulative for repeated calls to train() and train_word(), so you just keep increasing the counts stored in your word_to_count map and your total count of training words.

Making Predictions----------------------------------

The class needs to be able to predict the most probable word for a given word prefix give the statistics found during training. For example, after training on mobydick.txt, if the user types the prefix "wh", "wha", "whal" or "whale", the most likely word is "whale". The job of mapping word prefix strings to the most probable dictionary entry falls to the third instance variable:

prefix_to_entry: Map of string (prefix key) to DictEntry (value) pairs

A Map can map multiple keys to the same value. This is exactly what we need in this case since a set of strings (such as "wh", "wha", "whal" and "whale") may all need to map to a single DictEntry object. You need to ensure that each prefix maps to its most likely word. So even if the word "thespian" was encountered first in the training data, for any sensible training text, "th" should probably map to "the" and not "thespian".

Every time the build() method is called, you need to re-initialize the contents of prefix_to_entry. This ensures that you take into account any new words or updated counts present in word_to_count. Here is an overview of the algorithm you should be aiming for in build():

Loop over all possible (key, value) pairs in word_to_count.

For each pair: Calculate the probability of that word given its count and the number of training words. Create a new DictEntry object representing this word and its probability.

For each pair: Loop over all possible prefixes of the word. That is from a prefix of only the first letter to a "prefix" that is the entire word.

For each prefix, if the current key (word) has a probability strictly greater than what is currently stored for that prefix, replace the old value (a DictEntry object) with the new object.

This results in our map containing a mapping from every valid prefix of every training word to the most probable complete word under the unigram language model. Your get_best() method can be a one-liner!

Testing the Prediction Class---------------------------------------

The first part of main() tests out your training routines, while the second half focuses on the prediction part. Here is the output you can use to compare with for correctness (not all output will match exactly, think about this):

bad1 = null training words = 202 bad2 = null Could not open training file: thisfiledoesnotexist.txt training words = 202 count, the = 10 count, me = 5 count, zebra = 0 count, ishmael = 1 count, savage = 0 count, the = 32 count, me = 5 count, zebra = 0 count, ishmael = 1 count, savage = 1 a -> and 0.03917050691244239 ab -> about 0.004608294930875576 b -> bird 0.004608294930875576 be -> before 0.002304147465437788 t -> the 0.07373271889400922 th -> the 0.07373271889400922 archang -> archangelic 0.002304147465437788 training words = 434 before, b -> bird 0.004608294930875576 before, pn -> null after, b -> bird 0.004576659038901602 after, pn -> pneumonoultramicroscopicsilicovolcanoconiosis 0.002288329519450801 training words = 437 a -> and 0.030079417288752873 ab -> about 0.001472985727769356 b -> but 0.008580499385064212 be -> be 0.0048908846494866 t -> the 0.06757143265738066 th -> the 0.06757143265738066 archang -> archangel 3.3368608719694154E-5 training words = 209778 elapsed (s) : 0.459 Random load test: elapsed (s) : 3.447 Hit % = 30.0537

Grading Guidelines -----------------------------------------------------------------

This assignment is worth 100 points + 10 points bonus. Your assignment will be evaluated based on a successful compilation and adherence to the program requirements. We will grade according to the following criteria:

15 pts for correct Map implementation

15 pts for correct chaining collision handling

10 pts for correct DictEntry implementation

5 pts for correct get_word() and get_prob()

5 pts for correct math_pattern()

40 pts for correct WordPredictor implementation

10 pts for correct train()

10 pts for correct train_word()

10 pts for correct build()

5 pts for correct get_best()

5 pts for correct get_word_count() and get_training_count()

10 pts for correct definition and use of the word_to_count and prefix_to_entry maps

5 pts for defining additional functions/methods beyond those required in the assignment specifications.

5 pts for adherence to proper programming style and comments established for the class

moby_end.txt ------------------

But as the last whelmings intermixingly poured themselves over the sunken head of the Indian at the mainmast, leaving a few inches of the erect spar yet visible, together with long streaming yards of the flag, which calmly undulated, with ironical coincidings, over the destroying billows they almost touched;- at that instant, a red arm and a hammer hovered backwardly uplifted in the open air, in the act of nailing the flag faster and yet faster to the subsiding spar. A sky-hawk that tauntingly had followed the main-truck downwards from its natural home among the stars, pecking at the flag, and incommoding Tashtego there; this bird now chanced to intercept its broad fluttering wing between the hammer and the wood; and simultaneously feeling that etherial thrill, the submerged savage beneath, in his death-gasp, kept his hammer frozen there; and so the bird of heaven, with archangelic shrieks, and his imperial beak thrust upwards, and his whole captive form folded in the flag of Ahab, went down with his ship, which, like Satan, would not sink to hell till she had dragged a living part of heaven along with her, and helmeted herself with it.

Now small fowls flew screaming over the yet yawning gulf; a sullen white surf beat against its steep sides; then all collapsed, and the great shroud of the sea rolled on as it rolled five thousand years ago.

moby_start.txt-------------------------

Loomings

Call me Ishmael. Some years ago- never mind how long precisely- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off- then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.

mobydick.txt and wiki_200k too long to include