Question

1 Approved Answer

Posted on Sep 12, 2024

There are two files. One contains text to be read and analyzed, and is namedHarryPotter.txt. As the name implies, this file contains the full text

There are two files. One contains text to be read and analyzed, and is namedHarryPotter.txt. As the name implies, this file contains the full text from Harry Potter and the Sorcerers Stone.

For your convenience, all the punctuation has been removed, all the words have been converted to lowercase, and the entire document is written on a single line. The other file contains the 50 most common words in the English language, which your program will ignore. It is called ignoreWords.txt. Your program must take three command line arguments in the following order - a number N, the name of the text to be read, and the name of the text file with the words that should be ignored. It will read in the text (ignoring the words in the second file) and store all unique words in a dynamically doubling array. It should then calculate and print the following information: The number of array doublings needed to store all the unique words The number of unique non-ignore words in the file The total word count of the file (excluding the ignore words) The N most frequent words along with their probability of occurrence (up to 4 decimal places )

1. Use an array of structs to store the words and their counts There is an unknown number of words in the file. You will store each unique word and its count (the number of times it occurs in the document). Because of this, you will need to store these words in a dynamically sized array of structs . The struct must be defined as follows: struct wordItem { string word; int count; };

2. Use the array-doubling algorithm to increase the size of your array Your array will need to grow to fit the number of words in the file. Start with an array size of 100 , and double the size whenever the array runs out of free space. You will need to allocate your array dynamically and copy values from the old array to the new array. Note: Dont use the built-in std::vector class. This will result in a loss of points. You're actually writing the code that the built-in vector uses behind-the-scenes!

3. Ignore the top 50 most common words that are read in from the second file To get useful information about word frequency, we will be ignoring the 50 most common words in the English language. These words will be read in from a file, whose name is the third command line argument.

4. Take three command line arguments Your program must take three command line arguments - a number N which tells your program how many of the most frequent words to print, the name of the text file to be read and analyzed, and the name of the text file with the words that should be ignored.

5. Output the top N most frequent words Your program should print out the top N most frequent words in the text - not including the common words - where N is passed in as a command line argument.

6. Format your output this way: Array doubled: # U nique non-common words: # T otal non-common words: # P robabilities of top most frequent words --------------------------------------- < 1st highest probability> - < 2nd highest probability> - ... -

7. You must include the following functions (they will be tested by the autograder): a. In your main function i. If the correct number of command line arguments is not passed, print the below statement and exit the program std::cout << "Usage: Assignment2Solution " << std::endl; ii. Get stop-words/common-words from ignoreWords.txt and store them in an array (Call your getStopWords function) iii. Read words from HarryPotter.txt and store all unique words that are not ignore-words in an array of structs 1. Create a dynamic wordItem array of size 100 2. Add non-ignore words to the array (double the array size if array is full) 3. Keep track of the number of times the wordItem array is doubled and the number of unique non-ignore words b. void getStopWords(const char *ignoreWordFileName, string ignoreWords[]); This function should read the stop words from the file with the name stored in ignoreWordFileName and store them in the ignoreWords array. You can assume there will be exactly 50 stop words. There is no return value. In case the file fails to open, print an error message using the below cout statement: std::cout << "Failed to open " << ignoreWordFileName << std::endl; c. bool isStopWord(string word, string ignoreWords[]); This function should return whether word is in the ignoreWords array. d. int getTotalNumberNonStopWords(wordItem uniqueWords[], int length); This function should compute the total number of words in the entire document by summing up all the counts of the individual unique words. The function should return this sum. For example, in the quote below, there are 9 total non-stop words since about and the are stop words and programmers has a count of 2. e. void arraySort(wordItem uniqueWords[], int length); This function should sort the uniqueWords array (which contains length initialized elements) by word count such that the most frequent words are sorted to the beginning. The function doesnt return anything. f. void printTopN(wordItem uniqueWords[], int topN, int totalNumWords); This function should print out the first topN words of the sorted array uniqueWords. The exact format of this printing is described in part (6) of this writeup - you only need to print the last part (the most frequent words with their probability of occurrence up to 4 decimal places ). It doesnt return anything. Probability of occurrence of a word at position ind in the array is computed using the formula: (Dont forget to cast to float!) probability-of-occurrence = (float) uniqueWords[ind].count / totalNumWords

Ignore words:

the be to of and a in that have i it for not on with he as you do at this but his by from they we say her she or an will my one all would there their what so up out if about who get which go me