Question

1 Approved Answer

Posted on Sep 04, 2024

Question 5: Spark (40%) Write a function proc_headers (lst) that takes a list lst of Email headers, and returns a list of tuples (El, E2)

image text in transcribed

Question 5: Spark (40%) Write a function proc_headers (lst) that takes a list lst of Email headers, and returns a list of tuples (El, E2) for every Email transmission from E1 to E2 . Each header in 1st is described by a tuple (FROM, TO, CC, BCC) where FROM is the Email address of the sender, and each of the TO , CC , and BCC is a string holding a list of comma-separated EMail addresses matching the csv_regex pattern from Question 2. Your code should be written as a series of the following Spark transformations 1. Use sc.parallelize() to create a base RDD from 1st 2. Apply a filter() transformation to the base RDD created in step 1 to exclude all tuples where FROM is not a valid Email address. Use the valid email lambda from Question 4.1 3. Apply a map() transformation to the RDD produced at step 2 to convert each (FROM, TO, CC, BCC) tuple to a (FROM, RECPIENTS) tuple where RECIPIENTS is a concatenation of TO, cc, and BCC obtained by a lambda which is a composition of two concat_csv_strings lambdas from Question 4.2 4. Apply a map) transformation to the RDD produced at step 3 to convert each (FROM, RECPIENTS) tuple to a (FROM, EMAIL_SEQ) tuple where EMAIL SEQ is a sequence of EMail addresses in RECPIENTs extracted using the generator function gen seq from_csv string) from Question 3 above 5. Apply a flatMap() transformation to the RDD produced at step 4 to convert each (FROM, EMAIL_SEQ) tuple 6. Apply a filter) tranformation to the result of step 5 to exclude all tuples with an invalid recipient address. Use 7. Apply another filter() transformation to the outcome of step 6 to exclude all tuples having the same sender 8. Apply a collect ( action to the RDD produced at step 7, and return the resulting string to a sequence of tuples (FROM, E) for every Email E in EMAIL SEQ. Use the val_by vec lambda from Question 4.3 the valid email lambda from Question X and recipient Emails. Use the not_self_loop lambda from Question 4.4 In [7]: def proc_headers (lst): 1st: a list of tuples (FROM, TO, CC, BCC) representing EMail headers Returns a list of tuples (E1, E2) for every Email transmission from E1" to E2 using a series of Spark operations as described in the question Replace pass with your code. Use c to reference the Spark context. # sc.paralellize ( pass ) You can use the following code to test your implementation of proc_headers) In headerl-'bill.cordes@enron.com, mike.mcconnelleenron.com,cathy.phillips@enron.com,john.haggertyeenron.com' george.mcclellanenron.com,tom.kearney@enron.com', tom.kearney@enron.com, cathy.phillips@enron.com" header2mike.mcconnell@enron. .com bill.cordes@enron.com, tom.kearney@enron.com,cathy.phillips@enron.com, john.hag george.mcclellaneenron.com', mike.mcconnell@enron.com header3stuart.staley@enron.com, mike.mcconnell@enron.com,jeffrey.shankman@enron..com, bill.cordes@enron.com,tom.kearney@enron.com.cathy phillips@enron.com', george.mcclellan8enron.com, stuart.staley@enron.com" print('In' .join (str (t) for t in proc_headers ([headerl, header2, header3]))) The output produced by the 1ine above when executed with the model implementation of proc_headers() was as follows: ('bill.cordes@enron.com', 'mike.mcconnell@enron.com') ('bill.cordes@enron.com', "cathy.phillips@enron.com) ('bill.cordes@enron.com', john.haggerty@enron.com) ('bill.cordes@enron.com, george.mcclellan@enron.com') 'bill.cordes@enron.com,tom.kearney@enron.com') 'bill.cordes@enron.com tom.kearney@enron.com ('bill.cordes@enron.com', "cathy.phillips@enron.com') 'stuart.staley@enron.com','mike.mcconne11@enron.com') "stuart.staley@enron.com', bill.cordes@enron.com') "stuart.staley@enron.com, tom.kearney@enron.com) "stuart.staley@enron.com', george.mcclellan@enron.com') Question 5: Spark (40%) Write a function proc_headers (lst) that takes a list lst of Email headers, and returns a list of tuples (El, E2) for every Email transmission from E1 to E2 . Each header in 1st is described by a tuple (FROM, TO, CC, BCC) where FROM is the Email address of the sender, and each of the TO , CC , and BCC is a string holding a list of comma-separated EMail addresses matching the csv_regex pattern from Question 2. Your code should be written as a series of the following Spark transformations 1. Use sc.parallelize() to create a base RDD from 1st 2. Apply a filter() transformation to the base RDD created in step 1 to exclude all tuples where FROM is not a valid Email address. Use the valid email lambda from Question 4.1 3. Apply a map() transformation to the RDD produced at step 2 to convert each (FROM, TO, CC, BCC) tuple to a (FROM, RECPIENTS) tuple where RECIPIENTS is a concatenation of TO, cc, and BCC obtained by a lambda which is a composition of two concat_csv_strings lambdas from Question 4.2 4. Apply a map) transformation to the RDD produced at step 3 to convert each (FROM, RECPIENTS) tuple to a (FROM, EMAIL_SEQ) tuple where EMAIL SEQ is a sequence of EMail addresses in RECPIENTs extracted using the generator function gen seq from_csv string) from Question 3 above 5. Apply a flatMap() transformation to the RDD produced at step 4 to convert each (FROM, EMAIL_SEQ) tuple 6. Apply a filter) tranformation to the result of step 5 to exclude all tuples with an invalid recipient address. Use 7. Apply another filter() transformation to the outcome of step 6 to exclude all tuples having the same sender 8. Apply a collect ( action to the RDD produced at step 7, and return the resulting string to a sequence of tuples (FROM, E) for every Email E in EMAIL SEQ. Use the val_by vec lambda from Question 4.3 the valid email lambda from Question X and recipient Emails. Use the not_self_loop lambda from Question 4.4 In [7]: def proc_headers (lst): 1st: a list of tuples (FROM, TO, CC, BCC) representing EMail headers Returns a list of tuples (E1, E2) for every Email transmission from E1" to E2 using a series of Spark operations as described in the question Replace pass with your code. Use c to reference the Spark context. # sc.paralellize ( pass ) You can use the following code to test your implementation of proc_headers) In headerl-'bill.cordes@enron.com, mike.mcconnelleenron.com,cathy.phillips@enron.com,john.haggertyeenron.com' george.mcclellanenron.com,tom.kearney@enron.com', tom.kearney@enron.com, cathy.phillips@enron.com" header2mike.mcconnell@enron. .com bill.cordes@enron.com, tom.kearney@enron.com,cathy.phillips@enron.com, john.hag george.mcclellaneenron.com', mike.mcconnell@enron.com header3stuart.staley@enron.com, mike.mcconnell@enron.com,jeffrey.shankman@enron..com, bill.cordes@enron.com,tom.kearney@enron.com.cathy phillips@enron.com', george.mcclellan8enron.com, stuart.staley@enron.com" print('In' .join (str (t) for t in proc_headers ([headerl, header2, header3]))) The output produced by the 1ine above when executed with the model implementation of proc_headers() was as follows: ('bill.cordes@enron.com', 'mike.mcconnell@enron.com') ('bill.cordes@enron.com', "cathy.phillips@enron.com) ('bill.cordes@enron.com', john.haggerty@enron.com) ('bill.cordes@enron.com, george.mcclellan@enron.com') 'bill.cordes@enron.com,tom.kearney@enron.com') 'bill.cordes@enron.com tom.kearney@enron.com ('bill.cordes@enron.com', "cathy.phillips@enron.com') 'stuart.staley@enron.com','mike.mcconne11@enron.com') "stuart.staley@enron.com', bill.cordes@enron.com') "stuart.staley@enron.com, tom.kearney@enron.com) "stuart.staley@enron.com', george.mcclellan@enron.com')