Question
Question 2 a) Write code that opens the file term_data.txt and loads data into the following variables, in this order: termCount = number of times
Question 2
a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order:
termCount = number of times the term appeared in the document
length = total word count for the document
docCount = number of documents the term appears in at least once
totalDocs = total number of documents in the collection
Hint: You will need to include the right header file to complete this question.
b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console.
Question 3
It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesnt really matter. Base 2, 10, or e are all used.
Unfortunately, theres still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a smoothing term, which is a fancy way of saying, we add 1. So now, you would calculate idf like this:
idf = log (totalDocs/(docCount + 1))
Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula.
Hint: You may need to include a new header file to make this work!
4. Bonus question: Flowchart
Make a flowchart for the program from question 1 or 3; your choice which one.
text file info: 12 745 1459 1000000
a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order: termCount = number of times the term appeared in the document length = total word count for the document docCount = number of documents the term appears in at least once totalDocs = total number of documents in the collection Hint: You will need to include the right header file to complete this question. b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console. Question 3 It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesn't really matter. Base 2,10 , or e are all used. Unfortunately, there's still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a "smoothing" term, which is a fancy way of saying, we add 1 . So now, you would calculate idf like this: idf=log(totalDocs/(docCount+1)) Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula. Hint: You may need to include a new header file to make this work! 4. Bonus question: Flowchart Make a flowchart for the program from question 1 or 3 ; your choice which oneStep by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started