Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Course: Information retrival Programming Assignment (Indexer Part 1) The following example presents an example of what our text calls single pass in memory indexing. This

Course: Information retrival

Programming Assignment (Indexer Part 1)

The following example presents an example of what our text calls single pass in memory indexing. This indexer that has been developed using the Python scripting language. Your assignment will be to use this code to gain an understanding of how to generate an inverted index.

This simple python code will read through a directory of documents, tokenize each document and add terms extracted from the files to an index. The program will generate metrics from the corpus and will generate two files a document dictionary file and a terms dictionary file.

The terms dictionary file will contain each unique term discovered in the corpus in sorted order and will have a unique index number assigned to each term. The document dictionary will contain each document discovered in the corpus and will have a unique index number assigned to each document.

From our reading assignments we should recognize that a third document is required that will link the terms to the documents they were discovered in using the index numbers. Generating this third file will be a future assignment.

We will be using a small corpus of files that contain article and author information from articles submitted to the Journal Communications of the ACM.

The corpus is in a zip file in the resources section of this unit as is the example python code. (https://drive.google.com/open?id=1XovU8ZspaSp-3lq3Tp5jRKwzRjVVk04C)

You will either need to have the current version of Python 2.x installed on your computer or you can use the University of the People virtual lab to complete the assignment as the lab already has python installed. You will need to modify the code to change the directory where the files are found to match your environment. Although you can download the python file the contents of the file are as follows:

# Example code in python programming language demonstrating some of the features of an inverted index.

# In this example, we scan a directory containing the corpus of files. (In this case the documents are reports on articles # and authors submitted to the Journal "Communications of the Association for Computing Machinery" # # In this example we see each file being read, tokenized (each word or term is extracted) combined into a sorted # list of unique terms. # # We also see the creation of a documents dictionary containing each document in sorted form with an index assigned to it. # Each unique term is written out into a terms dictionary in sorted order with an index number assigned for each term. # From our readings we know that to complete teh inverted index all that we need to do is create a third file that will # coorelate each term with the list of documents that it was extracted from. We will do that in a later assignment. ## # We can further develop this example by keeping a reference for each term of the documents that it came from and by # developing a list of the documents thus creating the term and document dictionaries. # # As you work with this example, think about how you might enhance it to assign a unique index number to each term and to # each document and how you might create a data structure that links the term index with the document index. import sys,os,re import time # define global variables used as counters tokens = 0 documents = 0 terms = 0 termindex = 0 docindex = 0 # initialize list variable # alltokens = [] alldocs = [] # # Capture the start time of the routine so that we can determine the total running # time required to process the corpus # t2 = time.localtime() # set the name of the directory for the corpus # dirname = "c:\users\datai\cacm" # For each document in the directory read the document into a string # all = [f for f in os.listdir(dirname)] for f in all: documents+=1 with open(dirname+'/'+f, 'r') as myfile: alldocs.append(f) data=myfile.read().replace(' ', '') for token in data.split(): alltokens.append(token) tokens+=1 # Open for write a file for the document dictionary # documentfile = open(dirname+'/'+'documents.dat', 'w') alldocs.sort() for f in alldocs: docindex += 1 documentfile.write(f+','+str(docindex)+os.linesep) documentfile.close() # # Sort the tokens in the list alltokens.sort() # # Define a list for the unique terms g=[] # # Identify unique terms in the corpus for i in alltokens: if i not in g: g.append(i) terms+=1 terms = len(g) # Output Index to disk file. As part of this process we assign an 'index' number to each unique term. # indexfile = open(dirname+'/'+'index.dat', 'w') for i in g: termindex += 1 indexfile.write(i+','+str(termindex)+os.linesep) indexfile.close() # Print metrics on corpus # print 'Processing Start Time: %.2d:%.2d' % (t2.tm_hour, t2.tm_min) print "Documents %i" % documents print "Tokens %i" % tokens print "Terms %i" % terms t2 = time.localtime() print 'Processing End Time: %.2d:%.2d' % (t2.tm_hour, t2.tm_min)

The areas where you must update the code are identified in bold type. You should modify these to work for your environment. If you are working in linux or in the virtual computer lab remember that forward slashes must be changed to back slashes.

The requirements of this assignment include:

You must modify and execute the indexer against the CACM corpus. Although this will not build a complete index it will demonstrate key concepts such as

-Traversing a directory of documents

-Reading the document and extracting and tokenizing all of the text

-Computing counts of documents and terms

-Building a dictionary of unique terms that exist within the corpus

-Writing out to a disk file, a sorted term dictionary

As we will see in coming units the ability to count terms, documents, and compiling other metrics is vital to information retrieval and this first assignment demonstrates some of those processes.

Your terms dictionary and documents dictionary files must be stored on disk and uploaded as part of yoiur completed assignment. Your indexer must tokenize the contents of each document and extract terms.

Your indexer must report statistics on its processing and must print these statistics as the final output of the program.

o Number of documents processed

o Total number of terms parsed from all documents

o Total number of unique terms found and added to the index

When you have completed coding and testing your indexer program, you must execute your indexer against the corpus of documents in the cacm.zip file which can be downloaded in the resources section of Unit 2.

Capture the statistics output from your program after running it against the corpus. Your statistics must include all of the statistics listed above. You can capture the statistics by copying and pasting the output of your program directly into a document which you can upload as part of your assignment or you can manually record each statistic and include in your posting of your assignment.

If you are unable to complete the programming or you have trouble getting your code to work, you can submit the work that you have completed to solicit the feedback of your peers. However it is suggested that you use the course forum to post any difficulties you are having and seek the help of peers.

As you work with the indexer and corpus make note of your observations and provide a summary of your observations when posting your assignment. Examples of observations might include content of the data, running time, efficiency of the program and other observations.

Peer Assessment Criteria

This assignment will have four elements for peer assessment. Keep in mind that as part of your assessment process you should review and respond to the assessment questions and provide substantive feedback. Your instructor will be monitoring the quality of the feedback that you provide and a portion of your grade will be based upon the feedback that you provide to your peers.

Feedback can take the form of suggestions on how to improve the project, providing assistance to help fellow students complete their assignment, sharing best practices, tips, or resources that you have found useful or explaining concepts to your peers.

The four elements required of the assignment include:

The indexer python code uploaded as part of the submission

The documents.dat and index.dat uploaded as part of the submission

The metrics produced when the indexer was executed

A description of the assignment and observations made while running the indexer against the corpus

Google drive link to needed files: https://drive.google.com/open?id=1XovU8ZspaSp-3lq3Tp5jRKwzRjVVk04C

When submitting your assignment please make sure to include the following items:

1. A description of your assignment and your observations made while completing the assignment

2. Upload your modified python code.

3. Upload the generated index.dat and document.dat files.

4. Include the metrics produced when your indexer was executed against the CACM corpus.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions

Question

what are Advantages of investment companies to investors

Answered: 1 week ago