Question

1 Approved Answer

Posted on Sep 22, 2024

This has to been done in commands. Using only standard Linux commands, generate a histogram of all three-word sequences in the EEG Report database provided

image text in transcribed

This has to been done in commands.

Using only standard Linux commands, generate a histogram of all three-word sequences in the EEG Report database provided (see /data/courses/ece_3822/current/eeg_reports). We refer to these sequences as trigrams. Your output should list these sequences in decreasing order of occurrence. Compute the number of occurrences (essentially a histogram), the percentage of time a trigram occurs (the number of occurrences /the total number of trigrams) and a cumulative distribution (which isa useful representation because it shows how many trigrams are needed to cover 80% of the data) 1. Note that your trigram counter should be case-insensitive and ignore punctuation. For example suppose you have two text files, file1.txt and file2.txt. These files contain the following text: file1.txt: See Jane run. See file2.txt See jane rn Se- John run The trigrams present in this data are see 1ane rn jane run see run see john see john run see 1ane rn jane run see run see jane see jane run The output of your command line should be Fr Trigram see 1ane rn ane run see un see fohn see 1ohn run n see fane No. Percentage Cumulative 31 37.500090| 37.500090 25.0000901 62.5000% 12.5000901 75.000090 12.500090| 87.5000% 12.5000901 100.000090 Trigrams should be counted even when text is split across lines. However, you do not need to deal with beginning or end of file boundaries (edge effects) Since the list of trigrams you compute for the entire database will be very long, abbreviate your list to show: (1) the 10 most frequently occurring trigrams, (2) the trigrams that occur at the 25%, 50% and 75% percentiles, and the 10 least frequently occurring trigrams The output of your code MUST contain the columns above but does not need to contain the vertical or horizontal lines. In your document, you can insert the data into an MS Word table 2. Demonstrate that you can run the command you construct for task no. 1 from within a shellscript Create a shellscript called compute_trigrams.sh, insert your command into the file, set the permissions and other properties correctly, run it and demonstrate that it gives the proper output. This shellscript be run as shown below, must take a root directory as input, search all files below that directory and produce the same output as in task no. 1. It should be run as follows: compute_trigrams.sh /datalcourses/ece_3822/current/eeg_reports This shellscript should run on any popular version of Linux and run on a machine other than the AWS server. To test this, we will copy your script to our local Linux cluster and run it there. This is your first exposure to the issue of portability currenteeg reports Using only standard Linux commands, generate a histogram of all three-word sequences in the EEG Report database provided (see /data/courses/ece_3822/current/eeg_reports). We refer to these sequences as trigrams. Your output should list these sequences in decreasing order of occurrence. Compute the number of occurrences (essentially a histogram), the percentage of time a trigram occurs (the number of occurrences /the total number of trigrams) and a cumulative distribution (which isa useful representation because it shows how many trigrams are needed to cover 80% of the data) 1. Note that your trigram counter should be case-insensitive and ignore punctuation. For example suppose you have two text files, file1.txt and file2.txt. These files contain the following text: file1.txt: See Jane run. See file2.txt See jane rn Se- John run The trigrams present in this data are see 1ane rn jane run see run see john see john run see 1ane rn jane run see run see jane see jane run The output of your command line should be Fr Trigram see 1ane rn ane run see un see fohn see 1ohn run n see fane No. Percentage Cumulative 31 37.500090| 37.500090 25.0000901 62.5000% 12.5000901 75.000090 12.500090| 87.5000% 12.5000901 100.000090 Trigrams should be counted even when text is split across lines. However, you do not need to deal with beginning or end of file boundaries (edge effects) Since the list of trigrams you compute for the entire database will be very long, abbreviate your list to show: (1) the 10 most frequently occurring trigrams, (2) the trigrams that occur at the 25%, 50% and 75% percentiles, and the 10 least frequently occurring trigrams The output of your code MUST contain the columns above but does not need to contain the vertical or horizontal lines. In your document, you can insert the data into an MS Word table 2. Demonstrate that you can run the command you construct for task no. 1 from within a shellscript Create a shellscript called compute_trigrams.sh, insert your command into the file, set the permissions and other properties correctly, run it and demonstrate that it gives the proper output. This shellscript be run as shown below, must take a root directory as input, search all files below that directory and produce the same output as in task no. 1. It should be run as follows: compute_trigrams.sh /datalcourses/ece_3822/current/eeg_reports This shellscript should run on any popular version of Linux and run on a machine other than the AWS server. To test this, we will copy your script to our local Linux cluster and run it there. This is your first exposure to the issue of portability currenteeg reports