Question
An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an
An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries.
below is an inverted index that supports queries. This means that you will need a positional inverted index that maps a word to locations in a set of documents.
This inverted index should have these characteristics:
All words in the index should be lower case.
No punctuation, numbers, or symbols should be represented in the index.
These stopwords should not be included in the index: and, but, is, the, to. You may use any method you want to support this.
If the code for an inverted index is below. Write a query program that queries your inverted index. You should be able to support the following:
Boolean search queries which return documents that satisfies condition specified. You should be able to support AND and OR as well as a combination of the two.
For any word provided by a user, return the files with the word and for each instance a word appears in a file provide the position within the file.
=================================================================================
import string import pprint import json stop_words = ['and', 'but', 'is','to', 'the'] inverted_index = {} files_list =['filea.txt'] #files in directory doc_num = 0 for file in files_list: doc_num = doc_num + 1 words_to_delete = [] f = open(file) line = f.read().split() line_lower = [word.lower().translate(str.maketrans('', '', string.punctuation)) for word in line] # handle punctuation and lower case position = 0 for word in line_lower: position = position + 1 if word not in inverted_index: inverted_index[word] = [(files_list[doc_num - 1], position)] # make inverted index else: inverted_index[word] = inverted_index[word] + [(files_list[doc_num - 1], position)] for word in inverted_index: locs = inverted_index[word] new_locs = {} for item in locs: if item[0] not in new_locs: new_locs[item[0]] = [item[1]] else: new_locs[item[0]] = new_locs[item[0]] + [item[1]] inverted_index[word] = new_locs if word in string.punctuation or word in stop_words: words_to_delete.append(word) for word2 in words_to_delete: del inverted_index[word2] f.close() with open('file.json', 'w') as outfile: json.dump(inverted_index, outfile) print('Inverted Index is : ') # show results pprint.pprint(inverted_index)
where filea.txt is a text file with any text in it.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started