Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

This problem gets us started using a dictionary to hold a document index. Remember the keys are search terms and the values are a list

This problem gets us started using a dictionary to hold a document index. Remember the keys are search terms and the values are a list of documents containing that term. If we have a corpus, we can normalize and tokenize a document to get the tokens/search terms it contains. If we know a document's id, the logic of building an index is something like:

initialize index for each document in corpus: get a list of normalized tokens from the document for each token: add current document id to token's index entry

For this problem, a corpus of documents is stored in a dictionary. The key is the document id and the value is a string containing the document's text. The next two cells contain code that will populate the dictionary in your Jupyter notebook when you evaluate them.

import pickle

corpus = pickle.load(open("/usr/local/share/i427_dictionary_hw_corpus.p","rb"))

corpus

['I427 Search Informatics', 'I308 Information Representation', 'I101 Introduction to Informatics', 'Information Systems']

The only issue with the pseudocode above is that a new token needs to be added to the index if it's not already there. Let's update the pseudocode that way:

initialize index for each document in corpus: get a list of normalized tokens from the document for each token: if token is not in the index: initialize tokens entry in the index to an empty collection add current document id to token's index entry

For this problem, create a dictionary document_index that has the vocabulary in corpus as keys and for each vocabulary word, the value will be the list of document id's containing that corpus. The final answer (the contents of document_index are shown at the end to help you visualize the data structure and determine if your code worked or not.

Some other information or hints/tips:

-split on whitespace like we've seen for tokenization

-convert to lowercase like we've seen for normalization

-use the documents index in the corpus as an id. 0 for first doc, 1 for second, and so on

document_index = {} for document in corpus: normalized_tokens = tokenize(document) # is a list for token in normalized_tokens: if token not in document_index: document_index[key] = # empty collection for the value add current document id to token's index entry

print(document_index)

{'i427': [0], 'search': [0], 'informatics': [0, 2], 'i308': [1], 'information': [1, 3], 'representation': [1], 'i101': [2], 'introduction': [2], 'to': [2], 'systems': [3]}

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database And Expert Systems Applications Dexa 2023 Workshops 34th International Conference Dexa 2023 Penang Malaysia August 28 30 2023 Proceedings

Authors: Gabriele Kotsis ,A Min Tjoa ,Ismail Khalil ,Bernhard Moser ,Atif Mashkoor ,Johannes Sametinger ,Maqbool Khan

1st Edition

303139688X, 978-3031396885

More Books

Students also viewed these Databases questions