Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Java Project: Textual Analysis of books You are given a large file of text (e.g., books from Project Gutenberg). You need to read in the

Java Project: Textual Analysis of books

You are given a large file of text (e.g., books from Project Gutenberg). You need to read in the file and

perform an analysis of the text. Program details:

1. Letter Frequency: Consider all letters a z. For the given book, count the number of

occurrences of each letter. Print out the top-10 most frequent letters along with the frequency

in the book sorted by letter. Note a and A do not count as the same letters.

2. Word Frequency: Consider all words. For the given book, count the number of occurrences of

each word. Print out the top-10 most frequent words along with the frequency in the book

sorted by word.

3. Word Frequency with Stop List: A stop list is provided (stop-list.txt). This contains the 573

most common words in the English language. This includes words like the and and, which

usually dont add any semantic meaning to the text. Consider all words in the book that are not

included in the stop list. For the given book, count the number of occurrences of each word.

Print out the top-10 most frequent words along with the frequency in the book excluding words

in the stop list. (Random Aside: Word frequency with a stop list can give you a surprisingly good

summary of the book.)

You will do 2 different implementations of this program and compare the performance of each.

Implementation 1 stores the words or letters and their frequencies in a HashMap.

Implementation 2 stores the words or letters and their frequencies in an object defined by you. Think

FriendCount class from recitation. The complete set of objects are stored in an ArrayList.

After implementing both, you will compare their performance.

9 books from Project Gutenberg are provided (alice-in-wonderland.txt, huck-finn.txt, les-mis.txt,

metamorphosis.txt, my-man-jeeves.txt, pride-prejudice.txt, tale-of-two-cities.txt, tom-sawyer.txt). Run

your text analysis on these books. Feel free to download other books if you like.

Using the les-mis text, report in the table below how long in milliseconds it takes to add all of the letters

or words to the 2 different collections. Use System.currentTimeMillis() to get the time in milliseconds.

Report your results using the table below:

image text in transcribed

STOP List:

a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently d definitely described despite did didn't different do does doesn't doing don't done down downwards during e each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except f far few fifth first five followed following follows for former formerly forth four from further furthermore g get gets getting given gives go goes going gone got gotten greetings h had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself j just k keep keeps kept know knows known l last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd m mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself n name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere o obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own p particular particularly per perhaps placed please plus possible presumably probably provides q que quite qv r rather rd re really reasonably regarding regardless regards relatively respectively right s said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two u un under unfortunately unless unlikely until unto up upon us use used useful uses using usually uucp v value various very via viz vs w want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would would wouldn't x y yes yet you you'd you'll you're you've your yours yourself yourselves z zero gutenbergtm project gutenberg

Arrays (Version A) HashMap (Version B) Time in milliseconds Time in milliseconds Adding all letters to collection Adding all words to collection Arrays (Version A) HashMap (Version B) Time in milliseconds Time in milliseconds Adding all letters to collection Adding all words to collection

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Oracle Database 11g SQL

Authors: Jason Price

1st Edition

0071498508, 978-0071498500

More Books

Students also viewed these Databases questions

Question

Describe Table Structures in RDMSs.

Answered: 1 week ago