Question
The output for the indexer that we started to develop in unit 2 and we are continuing to develop in this unit (unit 3) includes
The output for the indexer that we started to develop in unit 2 and we are continuing to develop in this unit (unit 3) includes statistics such as the number of documents, number total terms and the number of unique terms in the collection added to the index. ?in the dictionary of the inverted index. Heaps law provides a formula that can be used to estimate the number of unique terms in a collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents. M = kT In textbook in section 5.1.1 (page 88 of the textbook), we are provided typical values for both k and b. The value of k is typically a range between 10 and 100 and ? .4 to .6. Using the formula for Heaps law calculate the estimated size of the vocabulary (M) using the total number of terms parsed from all documents statistic reported when running your indexer program. Given the fact that both k and are typically found through empirical analysis, assume that k will be 40 and will be .50. Compare the estimate with the total number of unique terms found and added to the index statistic reported by your indexer program which represents the actual size of the vocabulary in your collection. Report your findings in a posting response in the unit 3 discussion forum. If the size of the vocabulary estimated by Heaps law is not consistent with the vocabulary discovered by your indexer process speculate on why this may have occurred. Consider that this discrepancy may be uncovering a flaw in your program or that the corpus you are using may be limited in vocabulary due to its subject content. Discuss your findings with your peers and provide feedback to at least 3 peers on this submission.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started