Question
Mini Search Engine: Given a file of documents where each document occupies a line of the file, you are to build a data structure (called
- Mini Search Engine: Given a file of documents where each document occupies a line of the file, you are to build a data structure (called an inverse index) that allows you to identify those documents containing a given word. We will identify the documents by document number: the document represented by the first line of the file is document number 0, that represented by the second line is document number 1, and so on. Make matching of case-insensitive (e.g. Wall and wall are the same words)
Note that the period is considered part of a substring. To make this easier, we have a file of documents in which punctuation are separated from words by spaces. Often one wants to iterate through the elements of a list while keeping track of the indices of the elements. Python provides enumerate(L) for this purpose.
For example, start with a tiny test file --- doc0.txt which contains only three lines of data:
A bb Ccc , .
D ee A bb
FFF
The inverse_index should be a dictionary with the key storing each word, and the value storing the line containing all occurrence of that word:
{{'a': [0, 1], 'bb': [0, 1], 'ccc': [0], ',': [0], '.': [0], 'd': [1], 'ee': [1],
'fff': [2]}
So, a appears in both line 0 and 1. `fff appears in only line 2. Your assignment is to first write python code to generate such inverse_index dict, and then use it to perform queries in the following.
- Write a procedure make_inverse_index(strlist) that, given a list of strings (documents), returns a dictionary that maps each word to the set consisting of the document numbers of documents in which that word appears. This dictionary is called an inverse index. (Hint: use enumerate.)
- Write a procedure or_search(inverseIndex, query) which takes an inverse index and a list of words query, and returns the set of document numbers specifying all documents that contain any of the words in query.
- Write a procedure and_search(inverseIndex, query) which takes an inverse index and a list of words query, and returns the set of document numbers specifying all documents that contain all of the words in query.
- Try out your procedures on stories.txt. Try or_search, and_search with the queries united states and wall street on the documents in stories.txt and save the resulting lists (tw lists of document ids per query) in a file named results.txt.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started