Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Mini Search Engine: Given a file of documents where each document occupies a line of the file, you are to build a data structure (called

  1. Mini Search Engine: Given a file of documents where each document occupies a line of the file, you are to build a data structure (called an inverse index) that allows you to identify those documents containing a given word. We will identify the documents by document number: the document represented by the first line of the file is document number 0, that represented by the second line is document number 1, and so on. Make matching of case-insensitive (e.g. Wall and wall are the same words)

Note that the period is considered part of a substring. To make this easier, we have a file of documents in which punctuation are separated from words by spaces. Often one wants to iterate through the elements of a list while keeping track of the indices of the elements. Python provides enumerate(L) for this purpose.

For example, start with a tiny test file --- doc0.txt which contains only three lines of data:

A bb Ccc , .

D ee A bb

FFF

The inverse_index should be a dictionary with the key storing each word, and the value storing the line containing all occurrence of that word:

{{'a': [0, 1], 'bb': [0, 1], 'ccc': [0], ',': [0], '.': [0], 'd': [1], 'ee': [1],

'fff': [2]}

So, a appears in both line 0 and 1. `fff appears in only line 2. Your assignment is to first write python code to generate such inverse_index dict, and then use it to perform queries in the following.

  1. Write a procedure make_inverse_index(strlist) that, given a list of strings (documents), returns a dictionary that maps each word to the set consisting of the document numbers of documents in which that word appears. This dictionary is called an inverse index. (Hint: use enumerate.)
  2. Write a procedure or_search(inverseIndex, query) which takes an inverse index and a list of words query, and returns the set of document numbers specifying all documents that contain any of the words in query.
  3. Write a procedure and_search(inverseIndex, query) which takes an inverse index and a list of words query, and returns the set of document numbers specifying all documents that contain all of the words in query.
  4. Try out your procedures on stories.txt. Try or_search, and_search with the queries united states and wall street on the documents in stories.txt and save the resulting lists (tw lists of document ids per query) in a file named results.txt.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Sams Teach Yourself Beginning Databases In 24 Hours

Authors: Ryan Stephens, Ron Plew

1st Edition

067232492X, 978-0672324925

More Books

Students also viewed these Databases questions

Question

Explain what scientific methods are and why they are important.

Answered: 1 week ago

Question

a sin(2x) x Let f(x)=2x+1 In(be)

Answered: 1 week ago

Question

My opinions/suggestions are valued.

Answered: 1 week ago