Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Python Programming index.py import re import os import collections import time class index: def __init__(self,path): def buildIndex(self): #function to read documents from collection, tokenize and

Python Programming

image text in transcribed

image text in transcribed

image text in transcribed

index.py

import re import os import collections import time

class index: def __init__(self,path):

def buildIndex(self): #function to read documents from collection, tokenize and build the index with tokens #index should also contain positional information of the terms in the document --- term: [(ID1,[pos1,pos2,..]), (ID2, [pos1,pos2,]),.] #use unique document IDs

def and_query(self, query_terms): #function for identifying relevant docs using the index

def print_dict(self): #function to print the terms and posting list in the index

def print_doc_list(self): # function to print the documents and their document id

In this assignment, you will implement an inverted index in Python that can be used with Boolean queries with AND operators to retrieve relevant documents. You will implement the function in the code stub provided (see attachment). For your implementation, note the following design conditions: Use the code stub provided as an attachment and implement the functions in the stub. You may add additional functions as you see fit. Use only the python modules used in the stub. Use the document collection provided (see attachment) to generate the index. Tokenize the words in each document. There is no need to normalize the tokens. Do the basic tokenization i.e., remove all punctuations, and numerals. Terms in the index should all be in small caps. Each document should be given a unique document id. The posting list, in addition to the document id, should also contain the positions of the term in the document. The index should be in the form term: [(iD1,lposl,pos2,. .1), (ID2, [pos1,pos2, Implement a merge algorithm for processing AND Boolean queries. (see additional guidelines below) Print out the time for building the index and processing the query. Provide appropriate comments in code. 1. 2. 3. 4. 5. 6. 7. 8. 9. Merge Algorithm: (see AND query) in code stub) Implement an efficient algorithm to merge (AND operation) posting lists of multiple query terms. The algorithm should take a list of query terms (can be more than 2 terms) as input. In class, we looked at an algorithm for merging posting lists of two query terms by moving a pointer over all the relevant posting lists simultaneously. Generalize this for n terms. Note that an efficient algorithm looks at each item in the posting lists of the query term only once. Sample Output >>>execfile('/home/jkorah/mnt/ir/Assignments/Assignment1/code/index.py') >>>aindex(/home/jkorah/collection/') Index built in 103.25177598 seconds. >xa.and _query([with', 'without', 'yemen']) Results for the Query: with AND without AND yemen Total Docs retrieved: 6 Text-99.txt Text-159.txt Text-121.txt Text-115.txt Text-117.txt Text-86.txt Retrieved in 0.000526905059814 seconds >>>x-a.and _query([with', 'without', 'yemen', 'yemeni']) Results for the Query: with AND without AND yemen AND yemeni Total Docs retrieved: 2 Text-99.txt Text-121.txt Retrieved in 0.00115895271301 seconds >>> -.print_dict() zone [(111, [125]), (122, [82]), (147, [206]), (198, [1739]), (231, [632]), (249, [101]), (293, [88]), (306, [82, 519]), (329, [288]), (350, [335]), (371, [115]), (393, [246])] zones [(63, [451]), (261, [522]), (379, [798])] zoo [(400, [86]), (401, [196, 393])] zoom [(171, [640])] zoomed [(410, [830])] >>>x-a.print_doc list() Doc ID: 407>Text-49.txt Doc ID: 408 => Text-72.txt Doc ID: 409 > Text-36.txt Doc ID: 410-=> Text-349.txt Doc ID: 411 ==> Text-77.txt Doc ID: 412 > Text-183.txt Doc ID: 413 > Text-55.txt Doc ID: 414 => Text-201.txt Doc ID: 415>Text-242.txt Submission: Submit the following files on blackboard as a zip file 1. index.py 2. README filebriefly describing your merge algorithm and any other comments 3. Output.txt: containing 5 queries queries and the output generated by your code. In this assignment, you will implement an inverted index in Python that can be used with Boolean queries with AND operators to retrieve relevant documents. You will implement the function in the code stub provided (see attachment). For your implementation, note the following design conditions: Use the code stub provided as an attachment and implement the functions in the stub. You may add additional functions as you see fit. Use only the python modules used in the stub. Use the document collection provided (see attachment) to generate the index. Tokenize the words in each document. There is no need to normalize the tokens. Do the basic tokenization i.e., remove all punctuations, and numerals. Terms in the index should all be in small caps. Each document should be given a unique document id. The posting list, in addition to the document id, should also contain the positions of the term in the document. The index should be in the form term: [(iD1,lposl,pos2,. .1), (ID2, [pos1,pos2, Implement a merge algorithm for processing AND Boolean queries. (see additional guidelines below) Print out the time for building the index and processing the query. Provide appropriate comments in code. 1. 2. 3. 4. 5. 6. 7. 8. 9. Merge Algorithm: (see AND query) in code stub) Implement an efficient algorithm to merge (AND operation) posting lists of multiple query terms. The algorithm should take a list of query terms (can be more than 2 terms) as input. In class, we looked at an algorithm for merging posting lists of two query terms by moving a pointer over all the relevant posting lists simultaneously. Generalize this for n terms. Note that an efficient algorithm looks at each item in the posting lists of the query term only once. Sample Output >>>execfile('/home/jkorah/mnt/ir/Assignments/Assignment1/code/index.py') >>>aindex(/home/jkorah/collection/') Index built in 103.25177598 seconds. >xa.and _query([with', 'without', 'yemen']) Results for the Query: with AND without AND yemen Total Docs retrieved: 6 Text-99.txt Text-159.txt Text-121.txt Text-115.txt Text-117.txt Text-86.txt Retrieved in 0.000526905059814 seconds >>>x-a.and _query([with', 'without', 'yemen', 'yemeni']) Results for the Query: with AND without AND yemen AND yemeni Total Docs retrieved: 2 Text-99.txt Text-121.txt Retrieved in 0.00115895271301 seconds >>> -.print_dict() zone [(111, [125]), (122, [82]), (147, [206]), (198, [1739]), (231, [632]), (249, [101]), (293, [88]), (306, [82, 519]), (329, [288]), (350, [335]), (371, [115]), (393, [246])] zones [(63, [451]), (261, [522]), (379, [798])] zoo [(400, [86]), (401, [196, 393])] zoom [(171, [640])] zoomed [(410, [830])] >>>x-a.print_doc list() Doc ID: 407>Text-49.txt Doc ID: 408 => Text-72.txt Doc ID: 409 > Text-36.txt Doc ID: 410-=> Text-349.txt Doc ID: 411 ==> Text-77.txt Doc ID: 412 > Text-183.txt Doc ID: 413 > Text-55.txt Doc ID: 414 => Text-201.txt Doc ID: 415>Text-242.txt Submission: Submit the following files on blackboard as a zip file 1. index.py 2. README filebriefly describing your merge algorithm and any other comments 3. Output.txt: containing 5 queries queries and the output generated by your code

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

DB2 Universal Database V7.1 Application Development Certification Guide

Authors: Steve Sanyal, David Martineau, Kevin Gashyna, Michael Kyprianou

1st Edition

0130913677, 978-0130913678

More Books

Students also viewed these Databases questions

Question

What is inherent risk? What is residual risk?

Answered: 1 week ago