Question

1 Approved Answer

Posted on Sep 08, 2024

There is also a file called collection.zip that contains 423 txt files with about a paragraph of text on them. In this assignment, you will

image text in transcribed

There is also a file called collection.zip that contains 423 txt files with about a paragraph of text on them.

In this assignment, you will implement an inverted index in Python that can be used with Boolean queries with AND operators to retrieve relevant documents. You will implement the function in the code stub provided (see attachment). For your implementation, note the following design conditions: Use the code stub provided as an attachment and implement the functions in the stub. You may add additional functions as you see fit. Use only the python modules used in the stub. Use the document collection provided (see attachment) to generate the index. Tokenize the words in each document. There is no need to normalize the tokens. Do the basic tokenization i.e., remove all punctuations, and numerals. Terms in the index should all be in small caps. Each document should be given a unique document id. The posting list, in addition to the document id, should also contain the positions of the term in the document. The index should be in the form term: [(ID1,[pos1,pos2,]), (ID2, [pos1,pos2]] Implement a merge algorithm for processing AND Boolean queries. (see additional guidelines below) Print out the time for building the index and processing the query Provide appropriate comments in code. 1. 2. 3. 4. 5. 6. 7. 8. 9. Merge Algorithm: (see AND query) in code stub Implement an efficient algorithm to merge (AND operation) posting lists of multiple query terms The algorithm should take a list of query terms (can be more than 2 terms) as input. In class, we looked at an algorithm for merging posting lists of two query terms by moving a pointer over all the relevant posting lists simultaneously. Generalize this for n terms. Note that an efficient algorithm looks at each item in the posting lists of the query term only once Sample Output: >xecfile('/home/jkorah/mnt/ir/Assignments/Assignment1/code/index.py') >>aindex(/home/jkorah/collection/) Index built in 103.25177598 seconds. >>>x-a.and_query(['with', 'without', 'yemen']) Results for the Query: with AND without AND yemen Total Docs retrieved: 6 Text-99.txt Text-159.txt Text-121.txt Text-115.txt Text-117.txt Text-86.txt Retrieved in 0.000526905059814 seconds >>xand_query(['with', 'without', 'yemen', 'yemeni']) Results for the Query: with AND without AND yemen AND yemeni Total Docs retrieved: 2 Text-99.txt Text-121.txt Retrieved in 0.00115895271301 seconds >>>x-a.print_dict() zone [(111, [125]), (122, [82]), (147, [206]), (198, [1739]), (231, (632]), (249, [101]), (293, [88]), (306, (82, 519]), (329, [288]), (350, [335]), (371, [115]), (393, [246])] zones [(63, [451]), (261, [522]), (379, [798])) zoo [(400, [86]), (401, [196, 393])1 zoom [(171, [640])] zoomed ((410, [830) >>>x-aprint_doc_list() Doc ID: 407 Text-49.txt Doc ID: 408 Text-72.txt Doc ID: 409 > Text-36.txt Doc ID: 410 Text.349.txt Doc ID: 411> Text-77.txt Doc ID: 412> Text-183.txt Doc ID: 413 Text-55.txt Doc ID: 414 ==> Text-201.txt Doc ID: 415 Text-242.txt #Python 2 . 7 . 3 2 import re import os 4 import collections S import time 6 7 class index: 8def init_ (self,path): 9 10def buildIndex(self): #function to read documents from collection, tokenize and build the ndex with tokens .#index 12 should also contain positional information of the terms in the document term: [(1D1, [pos 1, pos2,..1), (ID2, [pos 1 , pos2, 1), 13 14 15def and query(self, query terms): 16-"#function for identifying relevant docs using the index 17 18 def print_dict(self): 19 2e 21def print_doc_list(self) 22 |..# function to print the documents and their document id 23 #use unique document IDs #function to print the terms and posting list in the index 3. Output.txt: containing 5 queries queries and the output generated by your code. In this assignment, you will implement an inverted index in Python that can be used with Boolean queries with AND operators to retrieve relevant documents. You will implement the function in the code stub provided (see attachment). For your implementation, note the following design conditions: Use the code stub provided as an attachment and implement the functions in the stub. You may add additional functions as you see fit. Use only the python modules used in the stub. Use the document collection provided (see attachment) to generate the index. Tokenize the words in each document. There is no need to normalize the tokens. Do the basic tokenization i.e., remove all punctuations, and numerals. Terms in the index should all be in small caps. Each document should be given a unique document id. The posting list, in addition to the document id, should also contain the positions of the term in the document. The index should be in the form term: [(ID1,[pos1,pos2,]), (ID2, [pos1,pos2]] Implement a merge algorithm for processing AND Boolean queries. (see additional guidelines below) Print out the time for building the index and processing the query Provide appropriate comments in code. 1. 2. 3. 4. 5. 6. 7. 8. 9. Merge Algorithm: (see AND query) in code stub Implement an efficient algorithm to merge (AND operation) posting lists of multiple query terms The algorithm should take a list of query terms (can be more than 2 terms) as input. In class, we looked at an algorithm for merging posting lists of two query terms by moving a pointer over all the relevant posting lists simultaneously. Generalize this for n terms. Note that an efficient algorithm looks at each item in the posting lists of the query term only once Sample Output: >xecfile('/home/jkorah/mnt/ir/Assignments/Assignment1/code/index.py') >>aindex(/home/jkorah/collection/) Index built in 103.25177598 seconds. >>>x-a.and_query(['with', 'without', 'yemen']) Results for the Query: with AND without AND yemen Total Docs retrieved: 6 Text-99.txt Text-159.txt Text-121.txt Text-115.txt Text-117.txt Text-86.txt Retrieved in 0.000526905059814 seconds >>xand_query(['with', 'without', 'yemen', 'yemeni']) Results for the Query: with AND without AND yemen AND yemeni Total Docs retrieved: 2 Text-99.txt Text-121.txt Retrieved in 0.00115895271301 seconds >>>x-a.print_dict() zone [(111, [125]), (122, [82]), (147, [206]), (198, [1739]), (231, (632]), (249, [101]), (293, [88]), (306, (82, 519]), (329, [288]), (350, [335]), (371, [115]), (393, [246])] zones [(63, [451]), (261, [522]), (379, [798])) zoo [(400, [86]), (401, [196, 393])1 zoom [(171, [640])] zoomed ((410, [830) >>>x-aprint_doc_list() Doc ID: 407 Text-49.txt Doc ID: 408 Text-72.txt Doc ID: 409 > Text-36.txt Doc ID: 410 Text.349.txt Doc ID: 411> Text-77.txt Doc ID: 412> Text-183.txt Doc ID: 413 Text-55.txt Doc ID: 414 ==> Text-201.txt Doc ID: 415 Text-242.txt #Python 2 . 7 . 3 2 import re import os 4 import collections S import time 6 7 class index: 8def init_ (self,path): 9 10def buildIndex(self): #function to read documents from collection, tokenize and build the ndex with tokens .#index 12 should also contain positional information of the terms in the document term: [(1D1, [pos 1, pos2,..1), (ID2, [pos 1 , pos2, 1), 13 14 15def and query(self, query terms): 16-"#function for identifying relevant docs using the index 17 18 def print_dict(self): 19 2e 21def print_doc_list(self) 22 |..# function to print the documents and their document id 23 #use unique document IDs #function to print the terms and posting list in the index 3. Output.txt: containing 5 queries queries and the output generated by your code