Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Implement a program in Java that receives as arguments an input directory and an output directory and that counts the number of unique words found

Implement a program in Java that receives as arguments an input directory and an output directory and that counts the number of unique words found in each file and writes them to the output directory.
The output files must follow the same folder structure as the input files. For example, if the program counts the words found in the input file stored at CleanedDataset1/folder6/document265.txt, it must store the counted words in the file at CountedDataset1/folder6/document265.txt, where CleanedDataset1 was the input directory and CountedDataset1 was the output directory.
In this program words are sequences of alphanumerical characters (0-9a-zA-Z) separated by a delimiter (\,\t,
,\r
,\r). This program will use the output of the previous program as input.
When the program finished counting the words from an input file it needs to write in the corresponding output file on each line the word and the number of occurrences, separated by a space.
For example, for the following input file:
EBooks posted since November 2003 with etext numbers OVER 10000 are
filed in a different way The year of a release date is no longer part
of the directory path The path is based on the etext number which is
identical to the filename The path to the file is made up of single
digits corresponding to all but the last digit in the filename For
example an eBook of filename 10234 would be found at
The program needs to create the corresponding output file that contains (this example includes only a subset of the output file):
...
filed 1
in 2
a2 different 1 way 1
The 3
year 1
of 4 release 1 date 1
is 4
longer 1 part 1
the 6 directory 1
CSC435 Distributed Systems I - Winter 20244
Jarvis College of Computing and Digital Media
DePaul University
path 3
based 1
number 1
which 1
identical 1
to 3
filename 3
...
Evaluate your program on the 5 datasets and measure (inside the program) the amount of data read from the input and the amount of (wall) time it took to count the words of all files. Make sure to clean the OS file system cache before you run an evaluation. Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiB/second (datasets size divided by total amount of time to count the words of the dataset).
Answer the following questions:
1. What data structure(s) did you use to implement the program and why?
2. What is the difference between compute-intensive, memory-intensive and IO-intensive applications?
3. Is your program compute-intensive, memory-intensive or IO-intensive and why?
4. Why would the dataset size influence the performance of your program on the virtual machine?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Concepts of Database Management

Authors: Philip J. Pratt, Mary Z. Last

8th edition

1285427106, 978-1285427102

More Books

Students also viewed these Databases questions