Question: Background and Introduction: Word counting is used in many applications, including text analysis, text indexing, text compression, and cryptography. In this programming assignment, you will

Background and Introduction:

Word counting is used in many applications, including text analysis, text indexing, text compression, and cryptography. In this programming assignment, you will read a text file and discover what words appear in that file and how many times each word appears.

In this assignment, you will create a dictionary (Map), inserting each word (String) that appears in the input text file into the dictionary. The dictionary will be implemented as a Map using classes from the JCF (TreeMap and HashMap). You will then determine which words appeared in the file most frequently. You will run the code on 2 input files of different sizes and different content to see which data structure performs better in practice.

Learning Objectives:

  • Increase your familiarity with the Map ADT.
  • Conduct experiments to compare the performance of data structures and their implementations.
  • Understand how different inputs can affect performance.

Tasks:

Write a program named WordCount.java to perform the tasks described below. (You may create additional classes to help perform this task).

1) The program should read a text file and build a List of words in the file. (This first part of your program is not timed.)

(The next steps in your program should be timed. Repeat the next steps 10 times and capture the best

time. Review the notes about how to capture the run time of an algorithm in the List lecture from week 3.) 2) Place the words from the word List into a TreeMap of frequency counts for the words. 3) Your program should then extract the top 'N' most frequent words and the frequency counts for those

words.

Next, repeat the above process using a HashMap instead of a TreeMap.

4) Your program must display the total number of words in the file, the top 'N' words and the count for each of those words, and the elapsed times for the TreeMap and HashMap implementations.

NOTE: You will experiment with 2 different text files to see how the performance of TreeMap and HashMap differ. As an example of a large piece of text, The Adventures of Sherlock Holmes is provided in the project archive (ASH.txt). This file was obtained from Project Gutenburg, whose website is http://www.gutenberg.org/. You must also run your test on the largest source code file in your project.

Make the program flexible.

  1. It should be easy for a user to select any text file to process
  2. It should be easy for a user to change the N for the size of the top N list

Your program might prompt for the user to enter these choices OR Your program might accept these choices as command line parameters OR

What to turn in:

Your summary must include the top 10 most frequent words in the ASH.txt file and the count for each of those words. Also include a description of how you extracted the top 10. (What was your algorithm? Which data structure(s) did you use?). The algorithm may use any classes from the JCF (not just HashMap and/or TreeMap).

Your summary must indicate the elapsed time that it took for TreeMap and for HashMap. Do NOT include the time for file IO in your empirical tests. (File IO is MUCH slower than any operations in memory.) Repeat the tests for both of your input files. Indicate the elapsed time when using TreeMap and when using HashMap. Finally, indicate any conclusions that you have reached about the relative performance of these two implementations of the Map ADT.

Sample display of empirical timing tests:

Using ASH.txt and using HashMap Best time for 10 runs:

Using ASH.txt and using TreeMap Best time for 10 runs:

XXXms XXXms

(Now repeat the same information for your tests on a java source code file.) __________________________

NOTE: You must design your word count program carefully so that it does not undercount the frequency of words. For instance the following Strings should all be counted as the same word: (ignore case and punctuation)

hello Hello "hello" hello, hello?

Can you think of other situations which might produce an undercount? If you try to handle EVERY possible condition you will have a difficult time. Perhaps as part of your development process you could display the contents of your Map and scan it for obvious issues such as those shown above. I do NOT expect you to handle every case, but I do expect you to handle the most common cases.

I suggest that you build several small test input files to check the correct functioning of your program.

If you would like to teach yourself something very useful and powerful to help with "String matching", then explore the use of regular expressions. (Use of regular expressions is NOT required in this project.) The textbook we are using includes an appendix about regular expressions.

Additional links to information about regular expressions in Java:

http://docs.oracle.com/javase/tutorial/essential/regex/ http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!