Question

1 Approved Answer

Posted on Sep 25, 2024

(cant provide the zip file, if so let me know how to upload files, just use any random files) PYTHON PROGRAM CODE Indexer: For this

(cant provide the zip file, if so let me know how to upload files, just use any random files)

PYTHON PROGRAM CODE

Indexer: For this part you need to implement an indexer to build an inverted index for a collection of web pages in a zipped file rhf.zip(you can use a random zip file with sstuff in it).

Note that the compressed file contains a collection of 18,808 files and 361 directories(junk files). Do not change the files and directories.

Your indexer shall include the following functionalities:

1.-A tokenizer: to tokenize web pages. This shall include ignoring html tags, non-textual

contents such as images. For textual content, it shall extract all words. It shall also

extract all urls of hyperlinks contained in the web page.

2.-A stop-words remover: You define a list of stop-words such as a, an, the, of,

etc. This part shall help remove all stop-words from those obtained from tokenizer.

3.-A document list. This shall record a document ID, document length, and maybe

additional information. You can use the url of the web page as its ID, or us a unique

integer for the ID but also record the url to refer to the web page.

4.-An inverted index. This index is based on the vector space model. It is a hash map of

the following element type:

o Word, document frequency of the word, and a document posting list that

records all the web pages containing the word o A document posting list is organized as a list of records.

o A record is composed of

A document ID for the web page containing the word A term frequency for the frequency of the word in the web page

A set of positions of the word appearing in the web page

Query parser: This part asks you to build a parser for processing query of one or more words. Below is a list of typical queries:

Type 1: Q1= adventures or tom or sawyer

Type 2: Q2 = adventures and tom and sawyer

Type 3: Q3 = tom sawyer Type 4: Q4 = tom and (not sawyer)

Type 5: Q5 = ((adventures tom sawyer) and huck) and (not huckleberry)

Here, and, or and not are Boolean operators. We interpret the space between any two nonblank parts of a query as a Boolean or operators. In the context of web search, the meaning of the above is obvious. For example, for Q3, wed like to search for documents that shall contain tom but not sawyer. The double operator means a phrase, e.g., query Q3 asks for documents contains the phrase tom sawyer.

Given any query of words, possibly along with Boolean operators, the phrase operators and parentheses, you need to design and implement a parser (some generalization of binary expression trees) to analyze and to calculate the documents that are relevant to the query. In such a tree, the internal nodes store operators and the leaf nodes represent the words appearing in the query.

For a collection of documents, any given word shall be linked to these documents that contain the word. To compute the relevant documents for a given query represented by a parser tree, you first need to find document lists for words stored at leaves. For an internal and operator, you shall create a new list of documents that appear in the every lists of its children nodes. For an internal or operator, you shall create a new list by merging all children lists, where somehow you need to find a way to take care of duplicated documents. Dealing with not is a little bit tricky for a very large document collection, because there often are too many documents that do not contain a given word. Instead, you may record the documents containing the word. For example, to find documents for Mark and (not Twain), you first create a list L1 of documents containing Mark and another list L2 of documents containing L2. Then, create a new list L2 by taking documents from L1 if they are inside L2. L3 is the answer.

Search Engine: This part asks the user to enter a query and decides whether the query is a Boolean type query or a general type query. If the query is a Boolean type query, then with the help of your query parser, you shall search your index to find all the relevant documents and return these document urls to the user. If the query is a general type query, then you shall use the cosine similarity metric to find a list of top-ranked documents relevant to the query and return the urls of these documents to the user.