Overview In this assignment you'l implement a data structure called a trie, which is used to answer queries regarding the characteristics of a text file

image text in transcribed

Overview In this assignment you'l implement a data structure called a trie, which is used to answer queries regarding the characteristics of a text file (e.g., frequency of a given word). This write-up introduces the concept of a trie, specifies the API you're expected to implement, and outlines submission instructions as well as the grading rubric. Please carefully read the entire write-up before you begin coding your submission. TrieS A trie is an example of a tree data structure that compactly stores the necessary information to answer frequencies of the words within a given text file. As in any compression scheme, a trie takes advantage of repetitions in the file; in the case of a text file, this means leveraging the fact that a) many words will likely occur several times throughout the file and b) many word share common prefixes. In particular, consider a file that only contains the following sentence Fred, Frank, and Fanny are all a fan of Fred. First, we note that for the sake of simplicity, we will first strip the file of punctuation and make every letter lowercase before constructing the trie. Therefore, our file after preforming this preprocessing now becomes: fred frank and fanny are al1 a fan of fred The trie we then construct for this file is shown in Figure 1 (on page 2). Specifically, the details of the construction are as follows The trie has a root node that has no information associated with it; it will just serve as the starting point of all of queries on the trie. root (f,0) (o,0) Figure 1: lustration of a trie for the sentence "Fred, Frank, and Fanny are all a fan of Fred. . Each non-root node stores character-frequency pair, where the frequency represents the number of times the word/prefix formed by observing characters from the root to the node appears in the text file. For example, the left-most node contains the pair (d, 2), since the path from the root to this node forms the word "fred", and fred" appears in the file twice; conversely, its parent node contains the pair (e, 0), as the word "fre" does not appear as a word in the file. Finally, observe that the trie maintains that the same prefir does not appear along multiple paths from the root. For instance, looking at the path which stores both the words "fan" and "fanny" it would be invalid to instead create a separate path from the (f,0) node storing the pairs (a,0) and (n, 1) (as then the prefix "fan" would now appear twice in the trie) Your task in this assignment will be to implement a trie class in Java (Trie.java), which will support methods for constructing a trie, as well methods that query the trie for information about the file after the trie has been constructed In term of the implementation for this project, we will think of a trie a structure we update as we scan through a fil, adding (or updating the frequencies) of the nodes as we observe new words API Details Your Trie.java class should support the following methods . Trie): Construct an "empty" trie, which just contains a root node void addWord (String word): Adds the word word to the trie, or increments the frequency of the word by one if it already exists in the trie. You may assume that word only contains lowercase letters int wordFrequency(String word): Return the number of times the word word ap- pears in the file used to construct the trie. . String kthword(int): Returns the string that, if we were take the list of unique words in the from the file and sort them alphabetically, would be at the kth position of this list (where the first word in the list is at position 1) For example, the sorted sequence of unique words from our earlier example is: a all and are fan fanny frank fred of and so kthWord (5) would return "fan 3 Starting Code and Submission On Canvas, there are three files you'l use to start your project: TestTrie.java, Trie.java, and alice.txt (which is a txt file of the Lewis Carrols Alice's Adventure in Wonderland) To set up your project, create a project in NetBeans called Test Trie. Then use TestTrie.java as the project main file, and then add Trie.java to the testtrie package folder. alice.txt should be placed in the top directory of the project (it's fine to place it somewhere else, but you'll have to change the path of the file in TestTrie.java in order to run the tests) Once you've implemented your trie class in Trie.java, you should be able to run TestTrie.java as is (i.e., you should not be modifying this file to make things work). There are ten tests in the file, which each simply make a call to either wordFrequency() or kthWord) and prints the result to standard out. The correct outputs are given in comments next to the function calls Please submit your Trie.java file on Canvas (we will use Canvas for submission of code files). There's no need to sub TestTrie.java or alice.txt; if you have additional .java files, submit them (along with Trie.java) in a compressed folder; however, ' also note that for this assignment, I'd prefer that the entire implementation is written in Trie.java. Evaluation Your submissionwll be scored a 50 point scale. The point breakdown by category is as follows . Overall Approach and Code Readability 10 points: This category will be scored based on the overall structure and approach of your implementation. For exam- ple, did you make good decisions as to what should be the private data for your class? How readable is your code? Did you include useful comments? Can I understand (with not too much effort) how you decided to approach your implementation? Method Implementation [20 points]: For this category, I will pick two of meth- ods required in the assignment API (in this assignment, two out of either addWord), wordFrequency, or kthWord) and do a more detailed evaluation of the imple tation based on correctness, efficiency, and code readability .Empirical Testing [20 point; 10 points public; 10 points private]: For this category, you will receive a grade based on tests I'll run on your code. Half of the points will come from public tests, which are included TestTrie.java. The other ten points will come from private tests, which I won't released prior to the submission deadline therefore, I highly encourage you to write your own tests in anticipation of what my tests may check for) Comments and Tips Since a trie is comprised of nodes, it would be a good idea to make a separate node class (ideally an inner class) that your trie uses to maintain its structure. Think carefully about what private data should belong to a single node, and what methods it will need to support. Implementing kthWord ) will likely be the most difficult part of the assignment. I'll make the following comments about this method: Implementing this method will be much easier (and likely much more efficient) if you keep your trie organized. For example, other than satisfying the properties of a trie, the trie in figure1 has almost no organization (e.g., what do you think is a better way to arrange the nodes?) -In terms of the actual algorithm, think in terms of the traversals we discussed in class for binary trees (e.g., preorder, inorder). In particular, writing a recursive helper method will likely be the best approach. Your algorithm should only work with the trie itself; therefore, do not store a list a unique list of words from the file, sort it, and then look up the kth word in the list (a trie is a compression scheme, and so doing this defeats the entire purpose of the trie). You will receive little to no credit if this is how you implement this method. Overview In this assignment you'l implement a data structure called a trie, which is used to answer queries regarding the characteristics of a text file (e.g., frequency of a given word). This write-up introduces the concept of a trie, specifies the API you're expected to implement, and outlines submission instructions as well as the grading rubric. Please carefully read the entire write-up before you begin coding your submission. TrieS A trie is an example of a tree data structure that compactly stores the necessary information to answer frequencies of the words within a given text file. As in any compression scheme, a trie takes advantage of repetitions in the file; in the case of a text file, this means leveraging the fact that a) many words will likely occur several times throughout the file and b) many word share common prefixes. In particular, consider a file that only contains the following sentence Fred, Frank, and Fanny are all a fan of Fred. First, we note that for the sake of simplicity, we will first strip the file of punctuation and make every letter lowercase before constructing the trie. Therefore, our file after preforming this preprocessing now becomes: fred frank and fanny are al1 a fan of fred The trie we then construct for this file is shown in Figure 1 (on page 2). Specifically, the details of the construction are as follows The trie has a root node that has no information associated with it; it will just serve as the starting point of all of queries on the trie. root (f,0) (o,0) Figure 1: lustration of a trie for the sentence "Fred, Frank, and Fanny are all a fan of Fred. . Each non-root node stores character-frequency pair, where the frequency represents the number of times the word/prefix formed by observing characters from the root to the node appears in the text file. For example, the left-most node contains the pair (d, 2), since the path from the root to this node forms the word "fred", and fred" appears in the file twice; conversely, its parent node contains the pair (e, 0), as the word "fre" does not appear as a word in the file. Finally, observe that the trie maintains that the same prefir does not appear along multiple paths from the root. For instance, looking at the path which stores both the words "fan" and "fanny" it would be invalid to instead create a separate path from the (f,0) node storing the pairs (a,0) and (n, 1) (as then the prefix "fan" would now appear twice in the trie) Your task in this assignment will be to implement a trie class in Java (Trie.java), which will support methods for constructing a trie, as well methods that query the trie for information about the file after the trie has been constructed In term of the implementation for this project, we will think of a trie a structure we update as we scan through a fil, adding (or updating the frequencies) of the nodes as we observe new words API Details Your Trie.java class should support the following methods . Trie): Construct an "empty" trie, which just contains a root node void addWord (String word): Adds the word word to the trie, or increments the frequency of the word by one if it already exists in the trie. You may assume that word only contains lowercase letters int wordFrequency(String word): Return the number of times the word word ap- pears in the file used to construct the trie. . String kthword(int): Returns the string that, if we were take the list of unique words in the from the file and sort them alphabetically, would be at the kth position of this list (where the first word in the list is at position 1) For example, the sorted sequence of unique words from our earlier example is: a all and are fan fanny frank fred of and so kthWord (5) would return "fan 3 Starting Code and Submission On Canvas, there are three files you'l use to start your project: TestTrie.java, Trie.java, and alice.txt (which is a txt file of the Lewis Carrols Alice's Adventure in Wonderland) To set up your project, create a project in NetBeans called Test Trie. Then use TestTrie.java as the project main file, and then add Trie.java to the testtrie package folder. alice.txt should be placed in the top directory of the project (it's fine to place it somewhere else, but you'll have to change the path of the file in TestTrie.java in order to run the tests) Once you've implemented your trie class in Trie.java, you should be able to run TestTrie.java as is (i.e., you should not be modifying this file to make things work). There are ten tests in the file, which each simply make a call to either wordFrequency() or kthWord) and prints the result to standard out. The correct outputs are given in comments next to the function calls Please submit your Trie.java file on Canvas (we will use Canvas for submission of code files). There's no need to sub TestTrie.java or alice.txt; if you have additional .java files, submit them (along with Trie.java) in a compressed folder; however, ' also note that for this assignment, I'd prefer that the entire implementation is written in Trie.java. Evaluation Your submissionwll be scored a 50 point scale. The point breakdown by category is as follows . Overall Approach and Code Readability 10 points: This category will be scored based on the overall structure and approach of your implementation. For exam- ple, did you make good decisions as to what should be the private data for your class? How readable is your code? Did you include useful comments? Can I understand (with not too much effort) how you decided to approach your implementation? Method Implementation [20 points]: For this category, I will pick two of meth- ods required in the assignment API (in this assignment, two out of either addWord), wordFrequency, or kthWord) and do a more detailed evaluation of the imple tation based on correctness, efficiency, and code readability .Empirical Testing [20 point; 10 points public; 10 points private]: For this category, you will receive a grade based on tests I'll run on your code. Half of the points will come from public tests, which are included TestTrie.java. The other ten points will come from private tests, which I won't released prior to the submission deadline therefore, I highly encourage you to write your own tests in anticipation of what my tests may check for) Comments and Tips Since a trie is comprised of nodes, it would be a good idea to make a separate node class (ideally an inner class) that your trie uses to maintain its structure. Think carefully about what private data should belong to a single node, and what methods it will need to support. Implementing kthWord ) will likely be the most difficult part of the assignment. I'll make the following comments about this method: Implementing this method will be much easier (and likely much more efficient) if you keep your trie organized. For example, other than satisfying the properties of a trie, the trie in figure1 has almost no organization (e.g., what do you think is a better way to arrange the nodes?) -In terms of the actual algorithm, think in terms of the traversals we discussed in class for binary trees (e.g., preorder, inorder). In particular, writing a recursive helper method will likely be the best approach. Your algorithm should only work with the trie itself; therefore, do not store a list a unique list of words from the file, sort it, and then look up the kth word in the list (a trie is a compression scheme, and so doing this defeats the entire purpose of the trie). You will receive little to no credit if this is how you implement this method