Question

1 Approved Answer

Posted on Sep 11, 2024

Question 2 a) Write code that opens the file term_data.txt and loads data into the following variables, in this order: termCount = number of times

image text in transcribed

Question 2

a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order:

termCount = number of times the term appeared in the document

length = total word count for the document

docCount = number of documents the term appears in at least once

totalDocs = total number of documents in the collection

Hint: You will need to include the right header file to complete this question.

b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console.

Question 3

It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesnt really matter. Base 2, 10, or e are all used.

Unfortunately, theres still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a smoothing term, which is a fancy way of saying, we add 1. So now, you would calculate idf like this:

idf = log (totalDocs/(docCount + 1))

Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula.

Hint: You may need to include a new header file to make this work!

4. Bonus question: Flowchart

Make a flowchart for the program from question 1 or 3; your choice which one.

text file info: 12 745 1459 1000000

a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order: termCount = number of times the term appeared in the document length = total word count for the document docCount = number of documents the term appears in at least once totalDocs = total number of documents in the collection Hint: You will need to include the right header file to complete this question. b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console. Question 3 It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesn't really matter. Base 2,10 , or e are all used. Unfortunately, there's still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a "smoothing" term, which is a fancy way of saying, we add 1 . So now, you would calculate idf like this: idf=log(totalDocs/(docCount+1)) Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula. Hint: You may need to include a new header file to make this work! 4. Bonus question: Flowchart Make a flowchart for the program from question 1 or 3 ; your choice which one