Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Question 2 a) Write code that opens the file term_data.txt and loads data into the following variables, in this order: termCount = number of times

image text in transcribed

Question 2

a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order:

termCount = number of times the term appeared in the document

length = total word count for the document

docCount = number of documents the term appears in at least once

totalDocs = total number of documents in the collection

Hint: You will need to include the right header file to complete this question.

b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console.

Question 3

It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesnt really matter. Base 2, 10, or e are all used.

Unfortunately, theres still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a smoothing term, which is a fancy way of saying, we add 1. So now, you would calculate idf like this:

idf = log (totalDocs/(docCount + 1))

Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula.

Hint: You may need to include a new header file to make this work!

4. Bonus question: Flowchart

Make a flowchart for the program from question 1 or 3; your choice which one.

text file info: 12 745 1459 1000000

a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order: termCount = number of times the term appeared in the document length = total word count for the document docCount = number of documents the term appears in at least once totalDocs = total number of documents in the collection Hint: You will need to include the right header file to complete this question. b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console. Question 3 It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesn't really matter. Base 2,10 , or e are all used. Unfortunately, there's still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a "smoothing" term, which is a fancy way of saying, we add 1 . So now, you would calculate idf like this: idf=log(totalDocs/(docCount+1)) Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula. Hint: You may need to include a new header file to make this work! 4. Bonus question: Flowchart Make a flowchart for the program from question 1 or 3 ; your choice which one

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Medical Image Databases

Authors: Stephen T.C. Wong

1st Edition

1461375398, 978-1461375395

More Books

Students also viewed these Databases questions

Question

Write short notes on Interviews.

Answered: 1 week ago