Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

For this project, you will: Be introduced to DataCount stemming such as removing 's' from the end of a word can lead to erroneous results

For this project, you will: Be introduced to DataCount stemming such as removing 's' from the end of a word can lead to erroneous results (such as "bu" from "bus") and require special logic. Even our simple rules cause problems; for instance, "it's" and "its" are now the same word. Implementing a better stemming algorithm (like Porter Stemming) is above and beyond work. Signature Generation A fundamental part of your work lies in computing the "signature" of a document. The professor has provided you with a sample WordCount program that reads in a document and counts the number of times that a stemmed word appears, assuming that the document's words are already stemmed. The output of this program looks like this: 970 the 708 and 666 of 632 to 521 at 521 i 521 into 466 a 444 my 391 in 383 buffalo ... where the number in column 1 is the frequency that the corresponding string in column 2 occurs in the text. Note that the WordCount program sorts its output primarily in decreasing order by frequency count, secondarily by alphabetical order. The ordering would be identical if it had sorted by frequency fraction first (i.e. frequency_count/num_total_words). Document Correlation Document correlation is a reasonably large area of study. Perhaps its most visible application is in search engines which rank the correlation of webpages to a set of keywords that you provide. One model often used for correlating documents is Latent Semantic Indexing (LSI) where a collection of documents is considered together (rather than independently) and a word's usefulness is determined by how frequently it appears in the documents (for instance, "the" isn't very useful because it appears in most documents). We will not be doing LSI. We will do a simpler document comparison: Calculate word counts for the two documents and normalize the frequencies so that they can be meaningfully compared between different documents (hint: use frequency percentages or fractions.) As in LSI, remove words whose relative frequencies are too high or too low to be useful to your study. A good starting point is to remove words with normalized frequencies above 0.01 (1%) and below 0.0001 (0.01%), but feel free to play around with these numbers. You should, however, report results with these frequencies so we can compare outputs with a standard set of numbers. For every word that occurs in both documents, take the difference between the normalized frequencies, square that difference, and add the result to a running sum. The final value of this running sum will be your difference metric. This metric corresponds to the square of the Euclidean distance between the two vectors in the space of shared words in the document. Note that this metric assumes that words not appearing in both documents do not affect the correlation. Partners You are encouraged (although not required) to work with a partner of your own choosing for this project. No more than two students total can be on a team together. Requirements There are five steps in this project: 1. Write two DataCounter dictionary implementations (AVL, Hash) 2. Modify WordCount to be able use your DataCounter implementations, and to select the implementation at runtime. 3. Write a document correlator that will print a difference score between two documents 4. Benchmark your data structures and correlator 5. Analyze and write up the results of your performance tests The analysis and writeup will be significantly longer in this project. Be sure to allocate time for it. It is worth 1/3 of your grade, and you will not be able to do it in an hour or two. DataCounter Implementations For this assignment, you must implement two data structures (you may reuse previous implementations): AVL tree Your implementation MUST extend BinarySearchTree included in the zip files regardless if you choose to modify the existing code for BinarySearchTree. Hash table You must implement your own hash code for the hash table. Do not use String.hashCode() Bothof these data structures must implement the DataCounter interface, which is a specialized version of DictionaryADT. You do not need to implement remove in any of these structures. You can implement any hash tables discussed in class; the only restriction is that it should not restrict the size of the input domain or the number of inputs (i.e. your hash table must grow). WordCount The WordCount program will read in a text file and tally up all the words that appear in it. The WordCount program given to you currently uses an unbalanced binary search tree as its backing DataCounter implementation and contains an insertion sort. You may base your WordCount program on it, or write your own. You need to add additional DataCounter implementations. The commandline form for WordCount will be as follows: java WordCount [ -b | -a | -h ] [ -frequency | -num_unique ] -b Use an Unbalanced BST to implement the DataCounter -a Use an AVL Tree -h Use a Hashtable -frequency Print all the word/frequency pairs, ordered by frequency, and then by the words in lexicographic order -num_unique Print the number of unique words in the document. This is the total number of distinct (different) words in the document. Words that appear more than once are only counted as a single word for this statistic. It is fine to require that one of -b, -a, or -h must be specified for your program to run. Your program should not crash, however, if given an invalid command line. Note that for the -frequency option, you need to produce words ordered primarily by frequency and secondarily by lexicographic (i.e., alphabetical) order. For example: 43 these 42 upon 42 your 41 after 41 into 40 said 39 many 39 more 38 an The sample WordCount program does this sorting using Insertion Sort... More on sorting projects next time! Document Correlator The Document Correlator should take in 2 documents and return the correlation (difference metric calculation) between them. You may want to use the WordCount class given to you to implement the backend of the Correlator, though doing so is not necessary. For the basic requirements, you must design an algorithm that does the comparison specified in section IIIc Document Correlation. This program should also take command line flags to specify which backing data structure to use. The commandline structure should be: Usage: java Correlator [ -b | -a | -h ] -b Use an Unbalanced BST in the backend -a Use an AVL Tree in the backend -h Use a Hashtable in the backend Benchmarks Since we are implementing two DataCounter dictionaries in this project, it is natural to ask "which is faster." Benchmarking and profiling are really the only reliable ways to judge this since there are many many hidden assumptions in the way you write your code that will add unexpected constants to your program. Hopefully you will do some exploration in this assignment and prove to yourself that you really can't predict what will affect program runtime too much (go through and try to optimize away little things like how many assignments you do, how many if statements you execute, etc. and see how much or little this affects your program). When generating (or reading) benchmarks, you must ask yourself the following questions: What am I measuring? Speed is too vague. Does it mean full program runtime? What if my program waits for user input? Does it matter? Why am I measuring this and why should anyone be interested in it? Full program runtime of an interactive user app where the users fall asleep while running the code isn't really interesting data. What methodology will I use to measure my program? Does it actually measure what I want? What are the sources of error? Is the error big enough to matter? Are my results still reliable? This set of questions actually applies to any analysis. You are required to design benchmarks that measure the attributes listed below. You may also include any other data that you feel necessary to draw conclusions from the benchmarks in your analysis. How fast is java WordCount -frequency compared to count.sh and count.pl? How much difference in speed do your different DataCounters make in the correlator and/or the wordcount? There are a few tools available to you for benchmarking. The simplest ones are: The Unix time command. System.nanoTime() or System.currentTimeMillis() (record the time at two different places in your program and subtract to get running time) Both methods have their strengths and weaknesses (for instance, time must measure your process creation times, and JVM startup times). These strengths and weaknesses will exhibit themselves differently depending on where and how these tools are used. In your analysis, you will need to cite the known sources for errors in your benchmarks and justify why they don't matter for your measurements, or somehow create a correction for your measurement. Essentially, you must convince us that your benchmark is measuring something that makes sense and that your analysis can be based off the collected data. For example, to time count.sh or count.pl, you can do the following: time ./count.sh your-file The syntax is the same for count.pl. Use the User time value that time returns.

import java.io.IOException;

/** * An executable that counts the words in a files and prints out the counts in * descending order. You will need to modify this file. */ public class WordCount {

private static void countWords(String file) { DataCounter counter = new BinarySearchTree();

try { FileWordReader reader = new FileWordReader(file); String word = reader.nextWord(); while (word != null) { counter.incCount(word); word = reader.nextWord(); } } catch (IOException e) { System.err.println("Error processing " + file + e); System.exit(1); }

DataCount[] counts = counter.getCounts(); sortByDescendingCount(counts); for (DataCount c : counts) System.out.println(c.count + " \t" + c.data); }

/** * TODO Replace this comment with your own. * * Sort the count array in descending order of count. If two elements have * the same count, they should be in alphabetical order (for Strings, that * is. In general, use the compareTo method for the DataCount.data field). * * This code uses insertion sort. You should modify it to use a different * sorting algorithm. NOTE: the current code assumes the array starts in * alphabetical order! You'll need to make your code deal with unsorted * arrays. * * The generic parameter syntax here is new, but it just defines E as a * generic parameter for this method, and constrains E to be Comparable. You * shouldn't have to change it. * * @param counts array to be sorted. */ private static > void sortByDescendingCount( DataCount[] counts) { for (int i = 1; i < counts.length; i++) { DataCount x = counts[i]; int j; for (j = i - 1; j >= 0; j--) { if (counts[j].count >= x.count) { break; } counts[j + 1] = counts[j]; } counts[j + 1] = x; } }

public static void main(String[] args) { if (args.length != 1) { System.err.println("Usage: filename of document to analyze"); System.exit(1); } countWords(args[0]); } }

------------------------------------------------------------------

/** * An example of how to test your binary search tree. You should use this as * inspiration for your unit tests. */ public class TestBinarySearchTree { public static void main(String[] args) { boolean dumpall = false, notest = false;

if (args.length > 1 || (args.length == 1 && args[0].compareTo("dump") != 0 && args[0] .compareTo("notest") != 0)) { System.err.println("Arguments: [dump] to dump all output"); System.err.println(" [notest] to skip tests"); System.err.println("No arguments justs tests output"); return; }

if (args.length == 1) { dumpall = true; if (args[0].compareTo("notest") == 0) notest = true; }

String[][] tests = { { "g", "h", "a", "b", "a", "h", "j", "e", "e", "f" }, { "5", "3", "1", "2", "7", "6", "0", "8", "9", "4", "3", "5", "0", "9" }, {} };

String[][] expected = { { "([g,1] . .)", "([g,1] . ([h,1] . .))", "([g,1] ([a,1] . .) ([h,1] . .))", "([g,1] ([a,1] . ([b,1] . .)) ([h,1] . .))", "([g,1] ([a,2] . ([b,1] . .)) ([h,1] . .))", "([g,1] ([a,2] . ([b,1] . .)) ([h,2] . .))", "([g,1] ([a,2] . ([b,1] . .)) ([h,2] . ([j,1] . .)))", "([g,1] ([a,2] . ([b,1] . ([e,1] . .))) ([h,2] . ([j,1] . .)))", "([g,1] ([a,2] . ([b,1] . ([e,2] . .))) ([h,2] . ([j,1] . .)))", "([g,1] ([a,2] . ([b,1] . ([e,2] . ([f,1] . .)))) ([h,2] . ([j,1] . .)))", "a,2 b,1 e,2 f,1 g,1 h,2 j,1 " }, { "([5,1] . .)", "([5,1] ([3,1] . .) .)", "([5,1] ([3,1] ([1,1] . .) .) .)", "([5,1] ([3,1] ([1,1] . ([2,1] . .)) .) .)", "([5,1] ([3,1] ([1,1] . ([2,1] . .)) .) ([7,1] . .))", "([5,1] ([3,1] ([1,1] . ([2,1] . .)) .) ([7,1] ([6,1] . .) .))", "([5,1] ([3,1] ([1,1] ([0,1] . .) ([2,1] . .)) .) ([7,1] ([6,1] . .) .))", "([5,1] ([3,1] ([1,1] ([0,1] . .) ([2,1] . .)) .) ([7,1] ([6,1] . .) ([8,1] . .)))", "([5,1] ([3,1] ([1,1] ([0,1] . .) ([2,1] . .)) .) ([7,1] ([6,1] . .) ([8,1] . ([9,1] . .))))", "([5,1] ([3,1] ([1,1] ([0,1] . .) ([2,1] . .)) ([4,1] . .)) ([7,1] ([6,1] . .) ([8,1] . ([9,1] . .))))", "([5,1] ([3,2] ([1,1] ([0,1] . .) ([2,1] . .)) ([4,1] . .)) ([7,1] ([6,1] . .) ([8,1] . ([9,1] . .))))", "([5,2] ([3,2] ([1,1] ([0,1] . .) ([2,1] . .)) ([4,1] . .)) ([7,1] ([6,1] . .) ([8,1] . ([9,1] . .))))", "([5,2] ([3,2] ([1,1] ([0,2] . .) ([2,1] . .)) ([4,1] . .)) ([7,1] ([6,1] . .) ([8,1] . ([9,1] . .))))", "([5,2] ([3,2] ([1,1] ([0,2] . .) ([2,1] . .)) ([4,1] . .)) ([7,1] ([6,1] . .) ([8,1] . ([9,2] . .))))", "0,2 1,1 2,1 3,2 4,1 5,2 6,1 7,1 8,1 9,2 " }, { "", "No Data" } };

boolean error = false; for (int i = 0; i < tests.length; i++) { BinarySearchTree tree = new BinarySearchTree(); for (int j = 0; j < tests[i].length; j++) { tree.incCount(tests[i][j]); String out = tree.dump(); if (notest || out.compareTo(expected[i][j]) != 0) error = true; if (dumpall) System.out.println(out); } if (tests[i].length < 1) { String out = tree.dump(); if (notest || out.compareTo(expected[i][0]) != 0) error = true; if (dumpall) System.out.println(out); }

DataCount[] cnt = tree.getCounts(); String out = ""; if (cnt != null && cnt.length > 0) for (int j = 0; j < cnt.length; j++) out += cnt[j].data + "," + cnt[j].count + " "; else out = "No Data"; if (notest || out.compareTo(expected[i][expected[i].length - 1]) != 0) error = true; if (dumpall) System.out.println(out); }

if (!notest) { if (error) System.out.println("Test failed!"); else System.out.println("Test passed."); } } }

------------------------------------------------------------

/** * TODO Replace this comment with your own. * * Stub code for an implementation of a DataCounter that uses a hash table as * its backing data structure. We included this stub so that it's very clear * that HashTable works only with Strings, whereas the DataCounter interface is * generic. You need the String contents to write your hashcode code. */ public class HashTable implements DataCounter {

/** {@inheritDoc} */ public DataCount[] getCounts() { // TODO Auto-generated method stub return null; }

/** {@inheritDoc} */ public int getSize() { // TODO Auto-generated method stub return 0; }

/** {@inheritDoc} */ public void incCount(String data) { // TODO Auto-generated method stub

}

}

------------------------------------------------------

import java.io.BufferedReader; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStreamReader; import java.io.StreamTokenizer;

/** * FileWordReader reads words from a file one-by-one, converting to lowercase * and eliminating punctuation. You can read a file in using: * */ class FileWordReader { StreamTokenizer tok;

public FileWordReader(String file) throws FileNotFoundException, IOException { tok = new StreamTokenizer(new BufferedReader(new InputStreamReader( new FileInputStream(file)))); tok.eolIsSignificant(false); tok.lowerCaseMode(true); tok.wordChars('a', 'z'); tok.wordChars('A', 'Z'); String ws = " \t.,!;:-[].,;!?*+-=\\|/(){}\"[]><'"; for (int i = 0; i < ws.length(); i++) { tok.whitespaceChars(ws.charAt(i), ws.charAt(i)); } }

public String nextWord() throws IOException { int ttype = tok.nextToken(); while (ttype != StreamTokenizer.TT_WORD && ttype != StreamTokenizer.TT_EOF) ttype = tok.nextToken(); if (ttype == StreamTokenizer.TT_EOF) return null; return tok.sval; } };

-----------------------------------------------

/** * Simple class to hold a piece of data and its count. The class has package * access so that the various implementations of DataCounter can access its * contents, but not client code. * * @param type of data whose count we are recording. */ class DataCount { /** * The data element whose count we are recording. */ E data;

/** * The count for the data element. */ int count;

/** * Create a new data count. * * @param data the data element whose count we are recording. * @param count the count for the data element. */ DataCount(E data, int count) { this.data = data; this.count = count; } }

/** * Interface for a data structure that allows you to count the number of times * you see each piece of data. * * Although you will be using this interface only with Strings, we have tried to * "genericize" the code as much as possible. DataCounter counts elements of an * unconstrained generic type E, and BinarySearchTree restricts E to Comparable * types. HashTable is String-only, because you'll be implementing your own * hashcode and will need access to the actual String contents. * * @param The type of data to be counted. */ public interface DataCounter {

/** * Increment the count for a particular data element. * * @param data data element whose count to increment. */ public void incCount(E data);

/** * The number of unique data elements in the structure. * * @return the number of unique data elements in the structure. */ public int getSize();

/** * Get an array of all of the data counts in the DataCounter structure. The * array should contain exactly one DataCount instance for each unique * element inserted into the structure. The elements do not need to be in * any particular order. * * @return an array of the data counts. */ public DataCount[] getCounts();

}

-----------------------------------------------

/**

* BSTCounter implements the DataCounter interface using a binary search tree to

* store the data items and counts.

*

* @param The type of the data elements. Note that we have strengthened the

* constraints on E such that E is now a Comparable.

*/

public class BinarySearchTree> implements

DataCounter {

/**

* The root of the binary search tree. root is null if and only if the tree

* is empty.

*/

protected BSTNode overallRoot;

/**

* Number of nodes in the binary search tree.

*/

protected int size;

/**

* Inner (non-static) class to represent a node in the tree. Each node

* includes a String and an integer count. The class is protected so that it

* may be accessed by subclasses of BSTCounter.

*/

protected class BSTNode {

/**

* The left child of this node.

*/

public BSTNode left;

/**

* The right child of this node.

*/

public BSTNode right;

/**

* The data element stored at this node.

*/

public E data;

/**

* The count for this data element.

*/

public int count;

/**

* Create a new data node. Also takes care of incrementing the tree

* size.

*

* @param data data element to be stored at this node.

*/

public BSTNode(E data) {

this.data = data;

count = 1;

left = right = null;

size++;

}

}

/**

* Create an empty binary search tree.

*/

public BinarySearchTree() {

overallRoot = null;

size = 0;

}

/** {@inheritDoc} */

public void incCount(E data) {

if (overallRoot == null) {

overallRoot = new BSTNode(data);

} else {

// traverse the tree

BSTNode currentNode = overallRoot;

while (true) {

// compare the data to be inserted with the data at the current

// node

int cmp = data.compareTo(currentNode.data);

if (cmp == 0) {

// current node is a match

currentNode.count++;

return;

} else if (cmp < 0) {

// new data goes to the left of the current node

if (currentNode.left == null) {

currentNode.left = new BSTNode(data);

return;

}

currentNode = currentNode.left;

} else {

// new data goes to the right of the current node

if (currentNode.right == null) {

currentNode.right = new BSTNode(data);

return;

}

currentNode = currentNode.right;

}

}

}

}

/** {@inheritDoc} */

public int getSize() {

return size;

}

/** {@inheritDoc} */

public DataCount[] getCounts() {

@SuppressWarnings("unchecked")

DataCount[] counts = new DataCount[size];

if (overallRoot != null)

traverse(overallRoot, counts, 0);

return counts;

}

/**

* Do an inorder traversal of the tree, filling in an array of DataCount

* objects with the count of each element. Doing an inorder traversal

* guarantees that the result will be sorted by element. We fill in some

* contiguous block of array elements, starting at index, and return the

* next available index in the array.

*

* @param counts The array to populate.

*/

protected int traverse(BSTNode root, DataCount[] counts, int idx) {

if(root != null) {

idx = traverse(root.left, counts, idx);

counts[idx] = new DataCount(root.data, root.count);

idx = traverse(root.right, counts, idx + 1);

}

return idx;

}

/**

* Dump the contents of the tree to a String (provided for debugging and

* unit testing purposes).

*

* @return a textual representation of the tree.

*/

protected String dump() {

if (overallRoot != null)

return dump(overallRoot);

return "";

}

/**

* Dump the contents of the subtree rooted at this node to a String

* (provided for debugging purposes).

*

* @return a textual representation of the subtree rooted at this node.

*/

protected String dump(BSTNode root) {

if(root == null)

return ".";

String out = "([" + root.data + "," + root.count + "] ";

out += dump(root.left);

out += " ";

out += dump(root.right);

out += ")";

return out;

}

}

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Beginning Apache Cassandra Development

Authors: Vivek Mishra

1st Edition

1484201426, 9781484201428

More Books

Students also viewed these Databases questions

Question

What is the Definition for Third Normal Form?

Answered: 1 week ago

Question

Provide two examples of a One-To-Many relationship.

Answered: 1 week ago