Question
This problem gets us started using a dictionary to hold a document index. Remember the keys are search terms and the values are a list
This problem gets us started using a dictionary to hold a document index. Remember the keys are search terms and the values are a list of documents containing that term. If we have a corpus, we can normalize and tokenize a document to get the tokens/search terms it contains. If we know a document's id, the logic of building an index is something like:
initialize index for each document in corpus: get a list of normalized tokens from the document for each token: add current document id to token's index entry
For this problem, a corpus of documents is stored in a dictionary. The key is the document id and the value is a string containing the document's text. The next two cells contain code that will populate the dictionary in your Jupyter notebook when you evaluate them.
import pickle
corpus = pickle.load(open("/usr/local/share/i427_dictionary_hw_corpus.p","rb"))
corpus
['I427 Search Informatics', 'I308 Information Representation', 'I101 Introduction to Informatics', 'Information Systems']
The only issue with the pseudocode above is that a new token needs to be added to the index if it's not already there. Let's update the pseudocode that way:
initialize index for each document in corpus: get a list of normalized tokens from the document for each token: if token is not in the index: initialize tokens entry in the index to an empty collection add current document id to token's index entry
For this problem, create a dictionary document_index that has the vocabulary in corpus as keys and for each vocabulary word, the value will be the list of document id's containing that corpus. The final answer (the contents of document_index are shown at the end to help you visualize the data structure and determine if your code worked or not.
Some other information or hints/tips:
-split on whitespace like we've seen for tokenization
-convert to lowercase like we've seen for normalization
-use the documents index in the corpus as an id. 0 for first doc, 1 for second, and so on
document_index = {} for document in corpus: normalized_tokens = tokenize(document) # is a list for token in normalized_tokens: if token not in document_index: document_index[key] = # empty collection for the value add current document id to token's index entry
print(document_index)
{'i427': [0], 'search': [0], 'informatics': [0, 2], 'i308': [1], 'information': [1, 3], 'representation': [1], 'i101': [2], 'introduction': [2], 'to': [2], 'systems': [3]}
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started