Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 07, 2024

PYTHON PROBLEM This problem gets us started using a dictionary to hold a document index. Remember the keys are search terms and the values are

PYTHON PROBLEM

This problem gets us started using a dictionary to hold a document index. Remember the keys are search terms and the values are a list of documents containing that term. If we have a corpus, we can normalize and tokenize a document to get the tokens/search terms it contains. If we know a document's id, the logic of building an index is something like:

initialize index for each document in corpus: get a list of normalized tokens from the document for each token: add current document id to token's index entry

For this problem, a corpus of documents is stored in a dictionary. The key is the document id and the value is a string containing the document's text. The next two cells contain code that will populate the dictionary in your Jupyter notebook when you evaluate them.

import pickle

corpus = pickle.load(open("/usr/local/share/i427_dictionary_hw_corpus.p","rb"))

corpus

['I427 Search Informatics', 'I308 Information Representation', 'I101 Introduction to Informatics', 'Information Systems']

The only issue with the pseudocode above is that a new token needs to be added to the index if it's not already there. Let's update the pseudocode that way:

initialize index for each document in corpus: get a list of normalized tokens from the document for each token: if token is not in the index: initialize tokens entry in the index to an empty collection add current document id to token's index entry

For this problem, create a dictionary document_index that has the vocabulary in corpus as keys and for each vocabulary word, the value will be the list of document id's containing that corpus. The final answer (the contents of document_index are shown at the end to help you visualize the data structure and determine if your code worked or not.

Some other information or hints/tips:

-split on whitespace like we've seen for tokenization

-convert to lowercase like we've seen for normalization

-use the documents index in the corpus as an id. 0 for first doc, 1 for second, and so on

document_index = {} for document in corpus: normalized_tokens = tokenize(document) # is a list for token in normalized_tokens: if token not in document_index: document_index[key] = # empty collection for the value add current document id to token's index entry

print(document_index)

{'i427': [0], 'search': [0], 'informatics': [0, 2], 'i308': [1], 'information': [1, 3], 'representation': [1], 'i101': [2], 'introduction': [2], 'to': [2], 'systems': [3]}

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Oracle Database Foundations Technology Fundamentals For IT Success

Oracle Database Foundations Technology Fundamentals For IT Success

Authors: Bob Bryla

1st Edition

0782143725, 9780782143720

More Books

Students also viewed these Databases questions

Question

★★★★★

Lancelot Company has provided the following comparative information: You have been asked to evaluate the historical performance of the company over the last five years. Selected industry ratios have...

Answered: 1 week ago

Question

★★★★★

Customers arrive at the automatic teller machine in accordance with a Poisson process with rate 12 per hour. The amount of money withdrawn on each transaction is a random variable with mean $30 and...

Answered: 1 week ago

Question

★★★★★

Dfinity promises its customers that network services are 90% reliability. The network consists of three serial nodes, each of which must work for the network to be operational. Consistent with its...

Answered: 1 week ago

Question

★★★★★

Please help me solve: Sam Hutchins is planning to operate a specialty bagel sandwich kiosk but is undecided about whether to locate in the downtown shopping plaza or in a suburban shopping mall....

Answered: 1 week ago

Question

★★★★★

List 5 datatypes. Write flowcharts is flowgorithm( follow what we did in class) for the following 2- Write a flowchart to print the difference of 2 numbers 3- Write a flowchart to print the sum and...

Answered: 1 week ago

Question

★★★★★

5S is one of the most widely adopted techniques from the lean manufacturing toolbox. Based on the "house of lean", it's provide an overview of the tools and methods of lean management that can be...

Answered: 1 week ago

Question

★★★★★

As a manufacturing company, Company Kappa sells their sachets of powdered juices to retailers, supermarkets and distributors instead of direct customers. Having said this, they usually deliver...

Answered: 1 week ago

Question

★★★★★

What process improvements can HeroEnergy adopt to improve standardised data, financial reporting and employee engagement? A newly established government agency focusing on renewable energy has just...

Answered: 1 week ago

Question

★★★★★

The Temper-Tone Steel Company produces two major products, a stainless steel sheet called Flextin and stainless steel silverware. Both products go through the same initial production process where...

Answered: 1 week ago

Question

★★★★★

Vaughn Co. purchased 61,7% Ivanhoe Company bonds for $61000 cash. Interest is payable annually on January 1. If 28 of the securities are sold on January 1 for $32700, the entry would include a credit...

Answered: 1 week ago

Question

★★★★★

cultural empathy: sensitivity to other cultures and a non-judgmental understanding of other cultures is probably essential;

Answered: 1 week ago

Question

★★★★★

diplomatic skills: an ability to deal with others, negotiate and to represent the parent company on foreign assignment; this may even involve interaction with politicians and government officials in...

Answered: 1 week ago

Question

★★★★★

adaptability and flexibility: this includes the ability to integrate with other people and other cultures and different types of business operations; being adaptable to change; having the ability to...

Answered: 1 week ago

Previous Question Next Question