Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Below is the assignment at hand with all the details. I was looking for a .sh file with code that does the following. Please attach

Below is the assignment at hand with all the details. I was looking for a .sh file with code that does the following. Please attach some screenshots of the commands that ran the file and created the files wanted in the problem.

Implement a bash script that analyzes the email addresses in a set of text documents. The script finds out 1) the email addresses that appear the most times in the documents and their numbers of occurrence, and 2) a complete list of emails with the same domains. 1. Objectives To further improve bash scripting skills To learn how to use regular expresses in scripts To learn how to use Linux facilities grep, sort, cut, uniq 2. Background All the documents are saved in files, and are organized in a directory with several levels of subdirectories. Thus, your bash script needs to search all the files in the directory (including the subdirectories at all the levels) for email addresses, rank the email addresses based on the number of occurrences, and print out top N email addresses, where N is a number specified as one argument in the command line when the script runs, and is between1 and 100 (including 1 and 100). Your script also extracts the domains in these top N email addresses, and search the documents for a list of email addresses with the same domains. An email address is in the format of localpart@domain. It consists of parts: a local-part, an @ symbol, and a domain name. For example, in my email address xiaoning.ding@njit.edu, xiaoning.ding is the local part, and njit.edu is the domain name. Different email systems may have different requirements on valid email addresses. However, in this assignment, the requirement for valid email addresses (i.e., the rules that your script uses to locate the email addresses from the source code) is as follows: Case-insensitive The local-part of the email address may use any of these ASCII characters: o uppercase and lowercase letters A to Z and a to z; o digits 0 to 9; o dot ., provided that it is not the first or last character, and provided also that it does not appear consecutively (e.g. John.Doe@example.com and john.m.doe@example.com are allowed; but John..Doe@example.com is not allowed). The domain name part is a list of dot-separated labels. Each label consists of o uppercase and lowercase letters A to Z and a to z; o digits 0 to 9, provided that the label is NOT the top-level domain name (e.g., com, edu, gov, etc. i.e., top-level domain name should consist of all letters) o hyphen -, provided that it is not the first or last character in the label. 3. Detailed Requirements and Instructions Name the program in the pattern SECTION#_NJITID#_1.sh. SECTION# is the threedigit section number of the CS288 section you registered (e.g., 001, 003, don't miss the leading 0s). NJITID# is the eight-digit NJIT ID (Not your UCID, Rutgers students also have NJIT IDs). 1 means this is the first problem. So your file name is something like 001_00123456_1.sh (DO NOT COPY THIS AS YOUR FILE NAME!). The grader may use a script to find and test your script. The script will not find your script if it has a different name. Submission that fails to name correctly will be charged 10% of points of this problem (10% of 100 points). The pathname of the directory containing the documents and the number of top email addresses are specified in the command line. The format of the command for running the script is as follows (Important! If you don't follow strictly, the script testing your script may not be able to run your script correctly, and you may get lower grades). ./your_script N pathname_of_a_directory In the command line, N is the number of email addresses with the most occurrences. Your script needs to generate two files in the current working directory. One file, named topemails.txt, contains all the email addresses that appear the most times in the documents and their numbers of times, one line for each email address with the number of times followed by a space and the email address. The list should be sorted with the email address appearing more times in the documents listed on the top. Email addresses appearing the same number of times should be sorted alphabetically. A sample output when searching the top N email addresses from Linux kernel source code is as follows: 540 dhowells@redhat.com 516 davem@davemloft.net 382 wlanfae@realtek.com 356 tim.gardner@canonical.com 349 Larry.Finger@lwfinger.net 349 michael.chan@broadcom.com 305 perex@perex.cz 270 apw@canonical.com 264 devel@lists.sourceforge.net 248 leann.ogasawara@canonical.com 235 ben@simtec.co.uk 209 behlendorf1@llnl.gov 205 dagb@cs.uit.no 201 tiwai@suse.de The other file, named emails_top_domains.txt, contains all the emails with the same domains as those top N email addresses, with one email address on each line. Note that a subdomain (e.g., cs.njit.edu) and a domain (e.g., njit.edu) are considered to be different domains. The file should include unique email address sorted alphabetically using sort. Do not try to use a single regular expression to describe all the rules that valid email addresses follow. To reduce the complexity, when you search for the strings that follow rules #1, rule #2, and rule #3, you can first find the strings that follow rule #1 and rule #2, and then within the search results you search for the strings that follow rule #3. When you use grep, you may be particularly interested in the I, h, o, and r options. Your script does not need to check the binary (i.e., non-text) files in the directory. You dont need to particularly write code to traverse the directory tree if you use grep -r. Reading the manual pages of sort, uniq, and cut will help you figure out how to find out the top N email addresses and their domains. You may use Linux kernel source code and the source code of PostgreSQL database server to test your script. Linux kernel source code has more than 50 thousands files under 3000+ directories and subdirectories. Its location is /usr/src/linux-source-4.4.0. PostgreSQL has about 8 thousand files in about 500 directories and subdirectories. Its location is /usr/src/postgresql-11rc1. You may compare the results of your script with those generated by search_emails. But your script is not allowed to use search_emails. Grading: 1. Your script can run without any error message with command ./your_script 1 ./new_dir where ./newdir is a directory without any files in it. ---- 10 points 2. Your script can finish within 1 minute without an error when it is run with the following command: ---- 10 points ./your_script 100 /usr/src/linux-source-4.4.0 3. When your script is run to search Linux source code, the topemails.txt file generated by your script should contain the emails that appear the most times. ---- 30 points Your script will be run with the following command, where N is a random integer chosen by our grader and is between 50 and 100. ./your_script N /usr/src/linux-source-4.4.0 Then, search_emails will be run with the same arguments. The email addresses in the topemails.txt files will be cut and then compared using diff -y --suppress-common-lines. The number of different lines should be less than N/10. 4. When your script is run to search Linux source code, the emails_top_domains.txt file generated by your script should include all the emails with the same domains as those appearing the most times in the source code. ---- 30 points The emails_top_domains.txt file that your script generates in the previous step will be examined. First, the domains of the email addresses in emails_top_domains.txt will be cut and sorted, and the unique domains are collected into a list. Then, another list of unique domains will be extracted and sorted in the same way from the email addresses in topemails.txt. The two lists of unique domains from the two files should be exactly same (14 points). This allows you to collect some points, even your script fails to include correct email addresses in topmails.txt. But, if your script cannot generate topemails.txt or the topemails.txt is in a wrong format, no points can be collected from this part. The emails_top_domains.txt file that your script generates will also be compared with the emails_top_domains.txt file generated by search_emails using diff -y --suppress-common-lines. The number of different lines should be less than M/10, where M is the number of lines in emails_top_domains.txt file generated by search_emails. 5. When your script is run to search PostgreSQL source code, the topemails.txt file generated by your script is similar to that generated by search_emails. ---- 10 points 6. When your script is run to search PostgreSQL source code, the emails_top_domains.txt file generated by your script should include all the emails with the same domains as those appearing the most times in the source code. ---- 10 points (4 points if all the domain names in your emails_top_domains.txt are the same as the domain names in your topemails.txt, and 6 additional points if your emails_top_domains.txt is identical to that generated by search_emails).

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions