Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Clean Dataset ( 2 0 % ) For this part and the remaining parts of the assignment you will need to download 5 datasets. The

Clean Dataset (20%)
For this part and the remaining parts of the assignment you will need to download 5 datasets. The datasets are ebooks stored as TXT files in folders. In order to download the datasets you have to log in with your DePaul username and password.
Implement a program in Java that receives as arguments an input directory and an output directory and that cleans the files from the input directory and writes the cleaned files to the output directory.
The cleaned files must follow the same folder structure as the input files. For example, if the program cleans the file stored at Dataset1/folder6/document265.txt, it must store the cleaned file in CleanedDataset1/folder6/document265.txt, where Dataset1 was the input directory and CleanedDataset1 was the output directory.
The input files are TXT files that contain words separated by separators and by delimiters. In this program, words are defined as any sequence of alphanumerical characters (0-9a-zA-Z). Delimiters are defined as the space, tab and new line characters (\,\t,
,\r
,\r) and any other character is considered a separator.
page3image27911936
The cleaning process that your program needs to implement has to abide by the following rules:
any \r character has to be eliminated;
any repeating sequence of delimiters must be replaced with the last delimiter in the sequence. For example, if your program encounters \r
\r
, it must replace it with
, because
was the last character in the delimiter sequence;
any separator must be eliminated. For example, if your program encounters document-01.txt, it must replace it with document01txt, because - and . are separator characters (they are not word characters or delimiter characters);
When the program has finished cleaning a file, the output file should contain words composed out of alphanumerical characters separated by only one delimiter and should not contain any separator characters and no repeating delimiters.
For example, if an input file has the following content:
EBooks posted since November 2003, with etext numbers OVER #10000, are
filed in a different way. The year of a release date is no longer part
of the directory path. The path is based on the etext number (which is
identical to the filename). The path to the file is made up of single
digits corresponding to all but the last digit in the filename. For
example an eBook of filename 10234 would be found at:
The output file of the corresponding example should be:
EBooks posted since November 2003 with etext numbers OVER 10000 are
filed in a different way The year of a release date is no longer part
of the directory path The path is based on the etext number which is
identical to the filename The path to the file is made up of single
digits corresponding to all but the last digit in the filename For
example an eBook of filename 10234 would be found at
Evaluate your program on the 5 datasets and measure (inside the program) the amount of data read from the input and the amount of (wall) time it took to clean all of the files. Make sure to clean the OS file system cache before you run an evaluation. Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiB/second (datasets size divided by total amount of time to clean the dataset).
Answer the following questions:
What is the difference between the wall time and the CPU time? What method or function did you use to measure the wall time?
How big are Dataset 1 and Dataset 6, measured in MB and MiB (Megabytes and Mebibytes, respectively)?
How fast is your disk, measured in MiB/seconds? How does your program throughput fare in comparison to the speed of your disk and why?
page3image27912320page4image27887488
4. Why would the dataset size influence the performance of your program on the virtual machine? What command did you use to clean the OS file system cache?
5.Have a copy of the java file and the datasets in the local machine. How do you run it on linux ubuntu server using scp command? Give detailed steps used to run it.
6.Which command did you use to clean the OS system cache?
7.How did you Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiB/second (datasets size divided by total amount of time to clean the dataset)?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions