Question
In this exercise, you will create a simplified Lucene index. To get partial credit in case of miscalculations, please give detailed solutions. Given the following
In this exercise, you will create a simplified Lucene index. To get partial credit in case of miscalculations, please give detailed solutions.
Given the following documents:
D1: You say "goodbye", I say "hello, hello, hello"
D2: You say stop, I say go.
D3: "Hello, hello, hello," you say "goodbye".
D4: I say yes, you say no
1. (4 points) Build the inverted index for the documents.
a. Dictionary file:
e.g.
Term DocFreq
hello 2
I 3
b. Posting file (terms are implicit) e.g.
Doc # Frequency
1 3
3 3
c. Position file (terms are implicit from dictionary file, use absolute position of terms in the document) e.g.
D1 D2 D3 D4
6,7,8 0 1,2,3 0
4 4 0 1
d. For a given query
Q: say goodbye,
describe the process to search the inverted index.
2. (2 points)
a. Estimate the total size of the inverted index files in bytes. Numbers and characters are counted as 4 bytes. Strings are counted as the number of characters multiplied by 4 bytes. For example, the size of string hello is 5*4 = 20 bytes.
b. Compare the result from 2a. to the total size of the documents in bytes.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started