Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequenco): D1: BettyBotter bought some

image text in transcribed

2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequenco): D1: BettyBotter bought some Butter D2: Then to Better Butter BettyBotter bought more Butter Assume that the stopword list contains words that begin with a lower case letter, and that stopwords are eliminated during pre-processing. No other change is made to tokens to get terms (e.g., the words are neither stemmed nor case-folded). For the given example, show (1) the dictionary and (2) the postings lists. Include all the relevant statistics, including the TF-IDF value as '(TF,IDF)' associated with each document id in the postings list, as detailed below. The dictionary contains terms, their (corpus) cumulative frequency, their document frequency, and a pointer labelled P1 for the ith postings list. The dictionary terms must be in lexicographic order, and so are the document ids in the postings lists. A postings list can start with a label Pi (to denote the target ith postings list), followed by the list of document ids with the associated TF-IDF statistics. The normalized length of a document is defined as the number of (non-stopword) term occurrences in the document. The term fiequency factor (IF) is the number of term occurrences in a document divided by the normalized length of the document. (You can just write the two numbers separated by a % '.) The inverse document frequency factor (IDF) is defined as the reciprocal of the number of documents that contain the term. For example, TF for the term "Buffalo" in the document "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" is 3/6, and the IDF for the term "Buffalo" is 1/1 because it is present in just one document (which is the entire corpus). For concreteness, format the dictionary entries and postings lists under the headings shown below: Dictionary: TERM CUMULATIVE-FREQ DOCUMENT-FREQ Label-Pi (for ptr) Postings lists: (Target) Label-Pi DOC-ID: (TF,IDF) ... DOC-ID: (TF, IDF) 3) Show the "relative" ranking of the documents for the query Butter stifying it in terms of the relevance scores

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Implementing Ai And Machine Learning For Business Optimization

Authors: Robert K Wiley

1st Edition

B0CPQJW72N, 979-8870675855

More Books

Students also viewed these Databases questions

Question

3. Outline the four major approaches to informative speeches

Answered: 1 week ago

Question

4. Employ strategies to make your audience hungry for information

Answered: 1 week ago