Background and Introduction Word counting is used in many applications, including text analysis, text indexing, text compression, and cryptography In this programming assignment, you will read a text file and discover what words appear in that file and how many times each word appears In this assignment, you will create a dictionary (Map), inserting each word (String) that appears in the input text file into the dictionary The dictionary will be implemented as a Map using classes from the JCF (TreeMap and HashMap) You will then determine which words appeared in the file most frequently You will run the code on 2 input files of different sizes and different content to see which data structure performs better in practice Learning Objectives Increase your familiarity with the Map ADT Conduct experiments to compare the performance of data structures and their implementations Understand how different inputs can affect performance Tasks Write a program named WordCount java to perform the tasks described below (You may create additional classes to help perform this task) 1) The program should read a text file and build a List of words in the file (This first part of your program is not timed ) (The next steps in your program should be timed Repeat the next steps 10 times and capture the best time Review the notes about how to capture the run time of an algorithm in the List lecture from week 3 ) 2) Place the words from the word List into a TreeMap of frequency counts for the words 3) Your program should then extract the top 'N' most frequent words and the frequency counts for those words Next, repeat the above process using a HashMap instead of a TreeMap 4) Your program must display the total number of words in the file, the top 'N' words and the count for each of those words, and the elapsed times for the TreeMap and HashMap implementations NOTE You will experiment with 2 different text files to see how the performance of TreeMap and HashMap differ As an example of a large piece of text, The Adventures of Sherlock Holmes is provided in the project archive (ASH txt) This file was obtained from Project Gutenburg, whose website is http www gutenberg org You must also run your test on the largest source code file in your project Make the program flexible It should be easy for a user to select any text file to process It should be easy for a user to change the N for the size of the top N list Your program might prompt for the user to enter these choices OR Your program might accept these choices as command line parameters OR What to turn in Your summary must include the top 10 most frequent words in the ASH txt file and the count for each of those words Also include a description of how you extracted the top 10 (What was your algorithm Which data structure(s) did you use ) The algorithm may use any classes from the JCF (not just HashMap and or TreeMap) Your summary must indicate the elapsed time that it took for TreeMap and for HashMap Do NOT include the time for file IO in your empirical tests (File IO is MUCH slower than any operations in memory ) Repeat the tests for both of your input files Indicate the elapsed time when using TreeMap and when using HashMap Finally, indicate any conclusions that you have reached about the relative performance of these two implementations of the Map ADT Sample display of empirical timing tests Using ASH txt and using HashMap Best time for 10 runs Using ASH txt and using TreeMap Best time for 10 runs XXXms XXXms ( Now repeat the same information for your tests on a java source code file ) NOTE You must design your word count program carefully so that it does not undercount the frequency of words For instance the following Strings should all be counted as the same word (ignore case and punctuation) hello Hello hello hello, hello Can you think of other situations which might produce an undercount If you try to handle EVERY possible condition you will have a difficult time Perhaps as part of your development process you could display the contents of your Map and scan it for obvious issues such as those shown above I do NOT expect you to handle every case, but I do expect you to handle the most common cases I suggest that you build several small test input files to check the correct functioning of your program If you would like to teach yourself something very useful and powerful to help with String matching , then explore the use of regular expressions (Use of regular expressions is NOT required in this project ) The textbook we are using includes an appendix about regular expressions Additional links to information about regular expressions in Java http docs oracle com javase tutorial essential regex http www vogella com tutorials JavaRegularExpressions article html

The Answer is in the image, click to view ...

Question: Background and Introduction: Word counting is used in many applications, including text analysis, text indexing, text compression, and cryptography. In this programming assignment, you will

Background and Introduction:

Word counting is used in many applications, including text analysis, text indexing, text compression, and cryptography. In this programming assignment, you will read a text file and discover what words appear in that file and how many times each word appears.

In this assignment, you will create a dictionary (Map), inserting each word (String) that appears in the input text file into the dictionary. The dictionary will be implemented as a Map using classes from the JCF (TreeMap and HashMap). You will then determine which words appeared in the file most frequently. You will run the code on 2 input files of different sizes and different content to see which data structure performs better in practice.

Learning Objectives:

Increase your familiarity with the Map ADT.
Conduct experiments to compare the performance of data structures and their implementations.
Understand how different inputs can affect performance.

Tasks:

Write a program named WordCount.java to perform the tasks described below. (You may create additional classes to help perform this task).

1) The program should read a text file and build a List of words in the file. (This first part of your program is not timed.)

(The next steps in your program should be timed. Repeat the next steps 10 times and capture the best

time. Review the notes about how to capture the run time of an algorithm in the List lecture from week 3.) 2) Place the words from the word List into a TreeMap of frequency counts for the words. 3) Your program should then extract the top 'N' most frequent words and the frequency counts for those

words.

Next, repeat the above process using a HashMap instead of a TreeMap.

4) Your program must display the total number of words in the file, the top 'N' words and the count for each of those words, and the elapsed times for the TreeMap and HashMap implementations.

NOTE: You will experiment with 2 different text files to see how the performance of TreeMap and HashMap differ. As an example of a large piece of text, The Adventures of Sherlock Holmes is provided in the project archive (ASH.txt). This file was obtained from Project Gutenburg, whose website is http://www.gutenberg.org/. You must also run your test on the largest source code file in your project.

Make the program flexible.

It should be easy for a user to select any text file to process
It should be easy for a user to change the N for the size of the top N list

Your program might prompt for the user to enter these choices OR Your program might accept these choices as command line parameters OR

What to turn in:

Your summary must include the top 10 most frequent words in the ASH.txt file and the count for each of those words. Also include a description of how you extracted the top 10. (What was your algorithm? Which data structure(s) did you use?). The algorithm may use any classes from the JCF (not just HashMap and/or TreeMap).

Your summary must indicate the elapsed time that it took for TreeMap and for HashMap. Do NOT include the time for file IO in your empirical tests. (File IO is MUCH slower than any operations in memory.) Repeat the tests for both of your input files. Indicate the elapsed time when using TreeMap and when using HashMap. Finally, indicate any conclusions that you have reached about the relative performance of these two implementations of the Map ADT.

Sample display of empirical timing tests:

Using ASH.txt and using HashMap Best time for 10 runs:

Using ASH.txt and using TreeMap Best time for 10 runs:

XXXms XXXms

(Now repeat the same information for your tests on a java source code file.) __________________________

NOTE: You must design your word count program carefully so that it does not undercount the frequency of words. For instance the following Strings should all be counted as the same word: (ignore case and punctuation)

hello Hello "hello" hello, hello?

Can you think of other situations which might produce an undercount? If you try to handle EVERY possible condition you will have a difficult time. Perhaps as part of your development process you could display the contents of your Map and scan it for obvious issues such as those shown above. I do NOT expect you to handle every case, but I do expect you to handle the most common cases.

I suggest that you build several small test input files to check the correct functioning of your program.

If you would like to teach yourself something very useful and powerful to help with "String matching", then explore the use of regular expressions. (Use of regular expressions is NOT required in this project.) The textbook we are using includes an appendix about regular expressions.

Additional links to information about regular expressions in Java:

http://docs.oracle.com/javase/tutorial/essential/regex/ http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

Here is the text: Modern and forwarding-thinking Ontario Tech University advances the discovery and application of knowledge to accelerate economic growth regional development and social innovation...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

Center for Learning and Technology COURSE SYLLABUS PRINCIPLES OF STATISTICS STA-201-GS Course Syllabus PRINCIPLES OF STATISTICS STA-201-GS Thomas Edison State College May 2015 Course Essentials...

Chapter 2 User-Centered Systems Design: A Brief History Abstract The intention of this book is to help you think about design from a user-centered perspective. Our aim is to help you understand what...

Please help with this. Text below BCS 101 - Programming Concepts and Problem Solving This course will provide an introduction to programming logic and problem solving techniques using different...

i want complete solution for my assignment and it should be without plagiarism COIT20274: Information Systems for Business Professionals, Term One 2016 Assignments 1 & 2 Requirements Assignment 1 -...

Content analysis is a widely used qualitative research technique. Rather than being a single method, current applications of content analysis show three distinct approaches: conventional, directed,...

plz answer all questions I will rate fore sure. ty Question 2. (18 points) Escherichia coli is a mesophilic facultative anaerobic bacterium that is normally found in the human gut. E. coli can also...

Let A and P be square matrices, with P invertible. Show that det (PAP -1 ) = detA.

Which of the following statements is most correct? If you add enough randomly selected stocks to a portfolio, you can completely eliminate all the market risk from the portfolio. If you formed a...

2. 2016 income statement: Income Statement Sales $138,000 Costs 71,500 Other Expenses 4,100 Depreciation 10,100 EBIT $ ? Interest 7,900 Taxable income $? Taxes 17,760 Net income $? The firm issued...

points to allocate between the following aspects of a job: Physical working conditions Hours of work Pay Relationships with your boss Relationships with colleagues Opportunity for promotion Benefi ts...

You have

Look up cases involving red- circling, which relates to job evaluation and equal pay claims. Examine the legal implications of the red- circling that took place in the Meteor case study.