Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 09, 2024

Clean Dataset ( 2 0 % ) For this part and the remaining parts of the assignment you will need to download 5 datasets. The

Clean Dataset $(20 %)$

For this part and the remaining parts of the assignment you will need to download $5$ datasets. The datasets are ebooks stored as TXT files in folders. In order to download the datasets you have to log in with your DePaul username and password.

Implement a program in Java that receives as arguments an input directory and an output directory and that cleans the files from the input directory and writes the cleaned files to the output directory.

The cleaned files must follow the same folder structure as the input files. For example, if the program cleans the file stored at Dataset $1 /$ folder $6 /$ document $265 .$ txt $,$ it must store the cleaned file in CleanedDataset $1 /$ folder $6 /$ document $265 .$ txt $,$ where Dataset $1$ was the input directory and CleanedDataset $1$ was the output directory.

The input files are TXT files that contain words separated by separators and by delimiters. In this program, words are defined as any sequence of alphanumerical characters $(0 - 9$ a $-$ zA $-$ Z $) .$ Delimiters are defined as the space, tab and new line characters $(\, \$ t $,$

$, \$ r

$, \$ r $)$ and any other character is considered a separator.

page $3$ image $27911936$

The cleaning process that your program needs to implement has to abide by the following rules:

any $\$ r character has to be eliminated;

any repeating sequence of delimiters must be replaced with the last delimiter in the sequence. For example, if your program encounters $\$ r

$\$ r

$,$ it must replace it with

$,$ because

was the last character in the delimiter sequence;

any separator must be eliminated. For example, if your program encounters document $- 01 .$ txt $,$ it must replace it with document $01$ txt $,$ because $-$ and $.$ are separator characters $($ they are not word characters or delimiter characters $)$ ;

When the program has finished cleaning a file, the output file should contain words composed out of alphanumerical characters separated by only one delimiter and should not contain any separator characters and no repeating delimiters.

For example, if an input file has the following content:

EBooks posted since November $2003,$ with etext numbers OVER # $10000,$ are

filed in a different way. The year of a release date is no longer part

of the directory path. The path is based on the etext number $($ which is

identical to the filename $) .$ The path to the file is made up of single

digits corresponding to all but the last digit in the filename. For

example an eBook of filename $10234$ would be found at:

The output file of the corresponding example should be:

EBooks posted since November $2003$ with etext numbers OVER $10000$ are

filed in a different way The year of a release date is no longer part

of the directory path The path is based on the etext number which is

identical to the filename The path to the file is made up of single

digits corresponding to all but the last digit in the filename For

example an eBook of filename $10234$ would be found at

Evaluate your program on the $5$ datasets and measure $($ inside the program $)$ the amount of data read from the input and the amount of $($ wall $)$ time it took to clean all of the files. Make sure to clean the OS file system cache before you run an evaluation. Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiB $/$ second $($ datasets size divided by total amount of time to clean the dataset $) .$

Answer the following questions:

What is the difference between the wall time and the CPU time? What method or function did you use to measure the wall time?

How big are Dataset $1$ and Dataset $6,$ measured in MB and MiB $($ Megabytes and Mebibytes, respectively $) ?$

How fast is your disk, measured in MiB $/$ seconds $?$ How does your program throughput fare in comparison to the speed of your disk and why?

page $3$ image $27912320$ page $4$ image $27887488$

$4 .$ Why would the dataset size influence the performance of your program on the virtual machine? What command did you use to clean the OS file system cache?

$5 .$ Have a copy of the java file and the datasets in the local machine. How do you run it on linux ubuntu server using scp command? Give detailed steps used to run it $.$

$6 .$ Which command did you use to clean the OS system cache?

$7 .$ How did you Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiB $/$ second $($ datasets size divided by total amount of time to clean the dataset $) ?$

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

Recognize how theories and research in development and learning are related to educational practice.

Answered: 1 week ago

Question

★★★★★

Brandy Dees recently bought Nievo Enterprises, a company that manufactures ice skates. Brandy decided to assume management responsibilities for the company and appointed herself president shortly...

Answered: 1 week ago

Question

★★★★★

Cupcakes incorporated plans to sell 18000 units of product C during April and 26000 units during May. Sales of product C during March were 11500 units. Past experience has shown that end-of month...

Answered: 1 week ago

Question

★★★★★

The number of people who lived in State A on July 1 2017 was 5,000,000. While The number of people in State A who got diagnosed with diabetes for the first time in 2017 was 50,000. The number of...

Answered: 1 week ago

Question

★★★★★

1) As their Consultant, explain the innovation process to them and outline the two (2) reasons innovation would be a better option to commence their business. 2) Outline the three (3) basic steps for...

Answered: 1 week ago

Question

★★★★★

Clearly written rules and policies help eliminate ______ in the workplace

Answered: 1 week ago

Question

★★★★★

Using the prescribed syntax of an Internet Protocol packet, construct an IP Version 4 TCP/IP transmission packet using the following particulars: Packet is sent from IP address 192.168.4.111 (MAC...

Answered: 1 week ago

Question

★★★★★

Suman Joshi, Managing Director of Omega Textiles, was reviewing two very different investment proposals. The first one is for expanding the capacity in the main line of business and the second one is...

Answered: 1 week ago

Question

★★★★★

A 50-hp, 230-V shunt motor has a field resistance of 17.70 and operates at full load when the line current is 181 A at 1,350 r/min. To increase the speed of the motor to 1,600 r/min, a resistance of...

Answered: 1 week ago

Previous Question Next Question