Question
Python Programming index.py import re import os import collections import time class index: def __init__(self,path): def buildIndex(self): #function to read documents from collection, tokenize and
Python Programming
index.py
import re import os import collections import time
class index: def __init__(self,path):
def buildIndex(self): #function to read documents from collection, tokenize and build the index with tokens #index should also contain positional information of the terms in the document --- term: [(ID1,[pos1,pos2,..]), (ID2, [pos1,pos2,]),.] #use unique document IDs
def and_query(self, query_terms): #function for identifying relevant docs using the index
def print_dict(self): #function to print the terms and posting list in the index
def print_doc_list(self): # function to print the documents and their document id
In this assignment, you will implement an inverted index in Python that can be used with Boolean queries with AND operators to retrieve relevant documents. You will implement the function in the code stub provided (see attachment). For your implementation, note the following design conditions: Use the code stub provided as an attachment and implement the functions in the stub. You may add additional functions as you see fit. Use only the python modules used in the stub. Use the document collection provided (see attachment) to generate the index. Tokenize the words in each document. There is no need to normalize the tokens. Do the basic tokenization i.e., remove all punctuations, and numerals. Terms in the index should all be in small caps. Each document should be given a unique document id. The posting list, in addition to the document id, should also contain the positions of the term in the document. The index should be in the form term: [(iD1,lposl,pos2,. .1), (ID2, [pos1,pos2, Implement a merge algorithm for processing AND Boolean queries. (see additional guidelines below) Print out the time for building the index and processing the query. Provide appropriate comments in code. 1. 2. 3. 4. 5. 6. 7. 8. 9. Merge Algorithm: (see AND query) in code stub) Implement an efficient algorithm to merge (AND operation) posting lists of multiple query terms. The algorithm should take a list of query terms (can be more than 2 terms) as input. In class, we looked at an algorithm for merging posting lists of two query terms by moving a pointer over all the relevant posting lists simultaneously. Generalize this for n terms. Note that an efficient algorithm looks at each item in the posting lists of the query term only once. Sample Output >>>execfile('/home/jkorah/mnt/ir/Assignments/Assignment1/code/index.py') >>>aindex(/home/jkorah/collection/') Index built in 103.25177598 seconds. >xa.and _query([with', 'without', 'yemen']) Results for the Query: with AND without AND yemen Total Docs retrieved: 6 Text-99.txt Text-159.txt Text-121.txt Text-115.txt Text-117.txt Text-86.txt Retrieved in 0.000526905059814 seconds >>>x-a.and _query([with', 'without', 'yemen', 'yemeni']) Results for the Query: with AND without AND yemen AND yemeni Total Docs retrieved: 2 Text-99.txt Text-121.txt Retrieved in 0.00115895271301 seconds >>> -.print_dict() zone [(111, [125]), (122, [82]), (147, [206]), (198, [1739]), (231, [632]), (249, [101]), (293, [88]), (306, [82, 519]), (329, [288]), (350, [335]), (371, [115]), (393, [246])] zones [(63, [451]), (261, [522]), (379, [798])] zoo [(400, [86]), (401, [196, 393])] zoom [(171, [640])] zoomed [(410, [830])] >>>x-a.print_doc list() Doc ID: 407>Text-49.txt Doc ID: 408 => Text-72.txt Doc ID: 409 > Text-36.txt Doc ID: 410-=> Text-349.txt Doc ID: 411 ==> Text-77.txt Doc ID: 412 > Text-183.txt Doc ID: 413 > Text-55.txt Doc ID: 414 => Text-201.txt Doc ID: 415>Text-242.txt Submission: Submit the following files on blackboard as a zip file 1. index.py 2. README filebriefly describing your merge algorithm and any other comments 3. Output.txt: containing 5 queries queries and the output generated by your code. In this assignment, you will implement an inverted index in Python that can be used with Boolean queries with AND operators to retrieve relevant documents. You will implement the function in the code stub provided (see attachment). For your implementation, note the following design conditions: Use the code stub provided as an attachment and implement the functions in the stub. You may add additional functions as you see fit. Use only the python modules used in the stub. Use the document collection provided (see attachment) to generate the index. Tokenize the words in each document. There is no need to normalize the tokens. Do the basic tokenization i.e., remove all punctuations, and numerals. Terms in the index should all be in small caps. Each document should be given a unique document id. The posting list, in addition to the document id, should also contain the positions of the term in the document. The index should be in the form term: [(iD1,lposl,pos2,. .1), (ID2, [pos1,pos2, Implement a merge algorithm for processing AND Boolean queries. (see additional guidelines below) Print out the time for building the index and processing the query. Provide appropriate comments in code. 1. 2. 3. 4. 5. 6. 7. 8. 9. Merge Algorithm: (see AND query) in code stub) Implement an efficient algorithm to merge (AND operation) posting lists of multiple query terms. The algorithm should take a list of query terms (can be more than 2 terms) as input. In class, we looked at an algorithm for merging posting lists of two query terms by moving a pointer over all the relevant posting lists simultaneously. Generalize this for n terms. Note that an efficient algorithm looks at each item in the posting lists of the query term only once. Sample Output >>>execfile('/home/jkorah/mnt/ir/Assignments/Assignment1/code/index.py') >>>aindex(/home/jkorah/collection/') Index built in 103.25177598 seconds. >xa.and _query([with', 'without', 'yemen']) Results for the Query: with AND without AND yemen Total Docs retrieved: 6 Text-99.txt Text-159.txt Text-121.txt Text-115.txt Text-117.txt Text-86.txt Retrieved in 0.000526905059814 seconds >>>x-a.and _query([with', 'without', 'yemen', 'yemeni']) Results for the Query: with AND without AND yemen AND yemeni Total Docs retrieved: 2 Text-99.txt Text-121.txt Retrieved in 0.00115895271301 seconds >>> -.print_dict() zone [(111, [125]), (122, [82]), (147, [206]), (198, [1739]), (231, [632]), (249, [101]), (293, [88]), (306, [82, 519]), (329, [288]), (350, [335]), (371, [115]), (393, [246])] zones [(63, [451]), (261, [522]), (379, [798])] zoo [(400, [86]), (401, [196, 393])] zoom [(171, [640])] zoomed [(410, [830])] >>>x-a.print_doc list() Doc ID: 407>Text-49.txt Doc ID: 408 => Text-72.txt Doc ID: 409 > Text-36.txt Doc ID: 410-=> Text-349.txt Doc ID: 411 ==> Text-77.txt Doc ID: 412 > Text-183.txt Doc ID: 413 > Text-55.txt Doc ID: 414 => Text-201.txt Doc ID: 415>Text-242.txt Submission: Submit the following files on blackboard as a zip file 1. index.py 2. README filebriefly describing your merge algorithm and any other comments 3. Output.txt: containing 5 queries queries and the output generated by your code
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started