Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 10, 2024

The output for the indexer that we started to develop in unit 2 and we are continuing to develop in this unit (unit 3) includes

The output for the indexer that we started to develop in unit 2 and we are continuing to develop in this unit (unit 3) includes statistics such as the number of documents, number total terms and the number of unique terms in the collection added to the index.

in the dictionary of the inverted index. Heaps law provides a formula that can be used to estimate the number of unique terms in a collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents.

M = kT

In textbook in section 5.1.1 (page 88 of the textbook), we are provided typical values for both k and b. The value of k is typically a range between 10 and 100 and .4 to .6. Using the formula for Heaps law calculate the estimated size of the vocabulary (M) using the total number of terms parsed from all documents statistic reported when running your indexer program. Given the fact that both k and are typically found through empirical analysis, assume that k will be 40 and will be .50. Compare the estimate with the total number of unique terms found and added to the index statistic reported by your indexer program which represents the actual size of the vocabulary in your collection. Report your findings in a posting response in the unit 3 discussion forum. If the size of the vocabulary estimated by Heaps law is not consistent with the vocabulary discovered by your indexer process speculate on why this may have occurred. Consider that this discrepancy may be uncovering a flaw in your program or that the corpus you are using may be limited in vocabulary due to its subject content. Discuss your findings with your peers and provide feedback to at least 3 peers on this submission.

You must post your initial response before being able to review other students responses. Once you have made your first response, you will be able to reply to other students posts. You are expected to make a minimum of 3 responses to your fellow students posts.

Previous code provided:

The following example presents an example of what our text calls single pass in memory indexing. This indexer that has been developed using the Python scripting language. Your assignment will be to use this code to gain an understanding of how to generate an inverted index.

This simple python code will read through a directory of documents, tokenize each document and add terms extracted from the files to an index. The program will generate metrics from the corpus and will generate two files a document dictionary file and a terms dictionary file.

The terms dictionary file will contain each unique term discovered in the corpus in sorted order and will have a unique index number assigned to each term. The document dictionary will contain each document discovered in the corpus and will have a unique index number assigned to each document.

From our reading assignments we should recognize that a third document is required that will link the terms to the documents they were discovered in using the index numbers. Generating this third file will be a future assignment.

We will be using a small corpus of files that contain article and author information from articles submitted to the Journal Communications of the ACM.

The corpus is in a zip file in the resources section of this unit as is the example python code.

You will either need to have the current version of Python 2.x installed on your computer or you can use the University of the People virtual lab to complete the assignment as the lab already has python installed.

You will need to modify the code to change the directory where the files are found to match your environment. Although you can download the python file the contents of the file are as follows:

# Example code in python programming language demonstrating some of the features of an inverted index.

# In this example, we scan a directory containing the corpus of files. (In this case the documents are reports on articles

# and authors submitted to the Journal "Communications of the Association for Computing Machinery"

# In this example we see each file being read, tokenized (each word or term is extracted) combined into a sorted

# list of unique terms.

# We also see the creation of a documents dictionary containing each document in sorted form with an index assigned to it.

# Each unique term is written out into a terms dictionary in sorted order with an index number assigned for each term.

# From our readings we know that to complete teh inverted index all that we need to do is create a third file that will

# coorelate each term with the list of documents that it was extracted from. We will do that in a later assignment.

# We can further develop this example by keeping a reference for each term of the documents that it came from and by

# developing a list of the documents thus creating the term and document dictionaries.

# As you work with this example, think about how you might enhance it to assign a unique index number to each term and to

# each document and how you might create a data structure that links the term index with the document index.

import sys,os,re

import time

# define global variables used as counters

tokens = 0

documents = 0

terms = 0

termindex = 0

docindex = 0

# initialize list variable

alltokens = []

alldocs = []

# Capture the start time of the routine so that we can determine the total running

# time required to process the corpus

t2 = time.localtime()

# set the name of the directory for the corpus

dirname = "c:\users\datai\cacm"

# For each document in the directory read the document into a string

all = [f for f in os.listdir(dirname)]

for f in all:

documents+=1

with open(dirname+'/'+f, 'r') as myfile:

alldocs.append(f)

data=myfile.read().replace(' ', '')

for token in data.split():

alltokens.append(token)

tokens+=1

# Open for write a file for the document dictionary

documentfile = open(dirname+'/'+'documents.dat', 'w')

alldocs.sort()

for f in alldocs:

docindex += 1

documentfile.write(f+','+str(docindex)+os.linesep)

documentfile.close()

# Sort the tokens in the list

alltokens.sort()

# Define a list for the unique terms

g=[]

# Identify unique terms in the corpus

for i in alltokens:

if i not in g:

g.append(i)

terms+=1

terms = len(g)

# Output Index to disk file. As part of this process we assign an 'index' number to each unique term.

indexfile = open(dirname+'/'+'index.dat', 'w')

for i in g:

termindex += 1

indexfile.write(i+','+str(termindex)+os.linesep)

indexfile.close()

# Print metrics on corpus

print 'Processing Start Time: %.2d:%.2d' % (t2.tm_hour, t2.tm_min)

print "Documents %i" % documents

print "Tokens %i" % tokens

print "Terms %i" % terms

t2 = time.localtime()

print 'Processing End Time: %.2d:%.2d' % (t2.tm_hour, t2.tm_min)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Directions From Relational To Distributed Multimedia And Object Oriented Database Systems

Authors: James Larson

1st Edition

★★★★★

Use the 9-box grid for identifying where employees fit in a succession plan and construct appropriate development plans for them. page 406

Answered: 1 week ago

Previous Question Next Question