Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Clean Dataset ( 2 0 % ) For this part and the remaining parts of the assignment you will need to download 5 datasets. The
Clean Dataset For this part and the remaining parts of the assignment you will need to download datasets. The datasets are ebooks stored as TXT files in folders. In order to download the datasets you have to log in with your DePaul username and password. Implement a program in Java that receives as arguments an input directory and an output directory and that cleans the files from the input directory and writes the cleaned files to the output directory. The cleaned files must follow the same folder structure as the input files. For example, if the program cleans the file stored at Datasetfolderdocumenttxt it must store the cleaned file in CleanedDatasetfolderdocumenttxt where Dataset was the input directory and CleanedDataset was the output directory. The input files are TXT files that contain words separated by separators and by delimiters. In this program, words are defined as any sequence of alphanumerical characters azAZ Delimiters are defined as the space, tab and new line characters t r r and any other character is considered a separator. pageimage The cleaning process that your program needs to implement has to abide by the following rules: any r character has to be eliminated; any repeating sequence of delimiters must be replaced with the last delimiter in the sequence. For example, if your program encounters r r it must replace it with because was the last character in the delimiter sequence; any separator must be eliminated. For example, if your program encounters documenttxt it must replace it with documenttxt because and are separator characters they are not word characters or delimiter characters; When the program has finished cleaning a file, the output file should contain words composed out of alphanumerical characters separated by only one delimiter and should not contain any separator characters and no repeating delimiters. For example, if an input file has the following content: EBooks posted since November with etext numbers OVER # are filed in a different way. The year of a release date is no longer part of the directory path. The path is based on the etext number which is identical to the filename The path to the file is made up of single digits corresponding to all but the last digit in the filename. For example an eBook of filename would be found at: The output file of the corresponding example should be: EBooks posted since November with etext numbers OVER are filed in a different way The year of a release date is no longer part of the directory path The path is based on the etext number which is identical to the filename The path to the file is made up of single digits corresponding to all but the last digit in the filename For example an eBook of filename would be found at Evaluate your program on the datasets and measure inside the program the amount of data read from the input and the amount of wall time it took to clean all of the files. Make sure to clean the OS file system cache before you run an evaluation. Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiBsecond datasets size divided by total amount of time to clean the dataset Answer the following questions: What is the difference between the wall time and the CPU time? What method or function did you use to measure the wall time? How big are Dataset and Dataset measured in MB and MiB Megabytes and Mebibytes, respectively How fast is your disk, measured in MiBseconds How does your program throughput fare in comparison to the speed of your disk and why? pageimagepageimage Why would the dataset size influence the performance of your program on the virtual machine? What command did you use to clean the OS file system cache? Have a copy of the java file and the datasets in the local machine. How do you run it on linux ubuntu server using scp command? Give detailed steps used to run it Which command did you use to clean the OS system cache? How did you Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiBsecond datasets size divided by total amount of time to clean the dataset
Clean Dataset
For this part and the remaining parts of the assignment you will need to download datasets. The datasets are ebooks stored as TXT files in folders. In order to download the datasets you have to log in with your DePaul username and password.
Implement a program in Java that receives as arguments an input directory and an output directory and that cleans the files from the input directory and writes the cleaned files to the output directory.
The cleaned files must follow the same folder structure as the input files. For example, if the program cleans the file stored at Datasetfolderdocumenttxt it must store the cleaned file in CleanedDatasetfolderdocumenttxt where Dataset was the input directory and CleanedDataset was the output directory.
The input files are TXT files that contain words separated by separators and by delimiters. In this program, words are defined as any sequence of alphanumerical characters azAZ Delimiters are defined as the space, tab and new line characters t
r
r and any other character is considered a separator.
pageimage
The cleaning process that your program needs to implement has to abide by the following rules:
any r character has to be eliminated;
any repeating sequence of delimiters must be replaced with the last delimiter in the sequence. For example, if your program encounters r
r
it must replace it with
because
was the last character in the delimiter sequence;
any separator must be eliminated. For example, if your program encounters documenttxt it must replace it with documenttxt because and are separator characters they are not word characters or delimiter characters;
When the program has finished cleaning a file, the output file should contain words composed out of alphanumerical characters separated by only one delimiter and should not contain any separator characters and no repeating delimiters.
For example, if an input file has the following content:
EBooks posted since November with etext numbers OVER # are
filed in a different way. The year of a release date is no longer part
of the directory path. The path is based on the etext number which is
identical to the filename The path to the file is made up of single
digits corresponding to all but the last digit in the filename. For
example an eBook of filename would be found at:
The output file of the corresponding example should be:
EBooks posted since November with etext numbers OVER are
filed in a different way The year of a release date is no longer part
of the directory path The path is based on the etext number which is
identical to the filename The path to the file is made up of single
digits corresponding to all but the last digit in the filename For
example an eBook of filename would be found at
Evaluate your program on the datasets and measure inside the program the amount of data read from the input and the amount of wall time it took to clean all of the files. Make sure to clean the OS file system cache before you run an evaluation. Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiBsecond datasets size divided by total amount of time to clean the dataset
Answer the following questions:
What is the difference between the wall time and the CPU time? What method or function did you use to measure the wall time?
How big are Dataset and Dataset measured in MB and MiB Megabytes and Mebibytes, respectively
How fast is your disk, measured in MiBseconds How does your program throughput fare in comparison to the speed of your disk and why?
pageimagepageimage
Why would the dataset size influence the performance of your program on the virtual machine? What command did you use to clean the OS file system cache?
Have a copy of the java file and the datasets in the local machine. How do you run it on linux ubuntu server using scp command? Give detailed steps used to run it
Which command did you use to clean the OS system cache?
How did you Plot a diagram showing how the size of the datasets, measured in MiB, influences the throughput of your program, measured in MiBsecond datasets size divided by total amount of time to clean the dataset
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access with AI-Powered Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started