Question

1 Approved Answer

Posted on Sep 07, 2024

Python Programming (Just need the Code) Index.py #Python 3.0 import re import os import collections import time #import other modules as needed class index: def

Python Programming (Just need the Code)

image text in transcribed

Index.py

#Python 3.0 import re import os import collections import time #import other modules as needed

class index: def __init__(self,path):

def buildIndex(self): #function to read documents from collection, tokenize and build the index with tokens # implement additional functionality to support methods 1 - 4 #use unique document integer IDs

def exact_query(self, query_terms, k): #function for exact top K retrieval (method 1) #Returns at the minimum the document names of the top K documents ordered in decreasing order of similarity score def inexact_query_champion(self, query_terms, k): #function for exact top K retrieval using champion list (method 2) #Returns at the minimum the document names of the top K documents ordered in decreasing order of similarity score def inexact_query_index_elimination(self, query_terms, k): #function for exact top K retrieval using index elimination (method 3) #Returns at the minimum the document names of the top K documents ordered in decreasing order of similarity score def inexact_query_cluster_pruning(self, query_terms, k): #function for exact top K retrieval using cluster pruning (method 4) #Returns at the minimum the document names of the top K documents ordered in decreasing order of similarity score

def print_dict(self): #function to print the terms and posting list in the index

def print_doc_list(self): # function to print the documents and their document id

Note: This is individual work In this assignment, you will improve on the indexes from assignment 1, and implement the vector space model with cosine similarity for retrieving the top K documents (ranked retrieval) from the collection of document provided as an attachment. You will then compare exact Top K search with methods of inexact top K search. The specifications are: Part A: Exact Top K Retrieval [20 In part A, you will construct a retrieval system for exact top K retrieval (henceforth denoted as Method 1) Use the TF-IDF weighting for the terms to construct the vectors for the documents and queries. Your index will be of the form o Term ID: lidft(ID1,wtjd, [posl,pos2,.]), (ID2, Wt2 dz, [pos1,pos2,..]),...] As in assignment 1, your program will accept the collection, process the documents and construct the index. Additionally, you will use the stop list (provided as an attachment) to weed out words from your dictionary. Note you should not index the document for each query. Simply do this at the beginning Your program will then accept a free text query, generate the vectors for the documents and the query and compute the cosine similarity score for the documents. You can assume that you will not get single term queries. The program will retrieve and display the names of the top K documents for each query in decreasing order of their score. . For each query, you will consider the results for your top K results as the baseline results. You will use this in part B You will also note the time taken to retrieve the results for each query Part B: Comparison of Inexact retrieval methods with inexact retrieval [30 You will also compare the performance of the exact top K with three inexact methods given below Method 2 Champion List: Implement a champion list method which will produce, for each term a list of r documents that is based on the weighted term frequency Wtd. Come up with a formula for r. Discuss this in your report. Remember the champion list is created only once fora collection. Method 3 Index Elimination: Implement index elimination by using only half the queries terms sorted in decreasing order of their IDF values. . Method 4 Simple Cluster pruning: You will randomly pick vN leaders (where N is the number of documents in the collection) and then use them to implement the cluster pruning. Note that you need to select the leaders only once. For each query, you will select a leader closest to the query and then retrieve the top K results. Remember, if you do not get the top K results with the first leader, you will also look at the next best leader and so on. Part C: Comparison of Inexact retrieval methods with inexact retrieval [251 Experimentally compare the performance of your implementations of the exact retrieval with that of inexact retrieval methods. What are your performance measures? How are they measured and compared? In your report, you will provide justifications for these measures. Use graphs, as appropriate, to compare performance and describe insights and conclusion drawn from your results. Also include details for your implementation in the report that may have an impact on performance. Other instructions: Implement in Python Comment your code appropriately