Question

1 Approved Answer

Posted on Sep 10, 2024

Question 6: Spark (10%) Write a function proc headers (1st, N) that applies further transformations to the dataset returned by the proc_ headers (lst) function

image text in transcribed

Question 6: Spark (10%) Write a function proc headers (1st, N) that applies further transformations to the dataset returned by the proc_ headers (lst) function from Question 5 to identify N Email addresses, which are most popular in terms of the number of Emails that where sent to them. The output must be a list of N tuples ( n, E) where n is the number of Email transmissions having E as their recipient address. The list must be sorted in the descending lexicographical order, that is, (nl, E1) > (n2, E2) if and only if either nl > n2 or nl n2 and El > E2 Hint: Use a map/reduceByKey pattern as in the word count example to pair Email addresses with their popularity counts, the sortBy transformation to sort them in the descending lexicographical order, and the take( action to extract the top-N records Note that calling proc_headers) first is only needed to prevent any errors in the implementation of Question 5 from propagating to the solution of this question as this way, we will be able to use the model implementation of proc_headers() for testing. A more efficient solution would avoid materializing the results of proc_headers() in the driver, and instead directly extend the processing steps of Question 5 with further operations. Make sure you understand why it is important! (6): def get-top-emails (1st, N): lst: a list of tuples (FROM, To, CC, BCC) representing EMail headers N: a positive integer Returns a list of N tuples (n, E) sorted in the descending lexicographical order representing the top N most popular EMail destinations as described in the question rddsc.parallelize(proc_headers (lst)) # Insert your code after this line You can use the following code to test your impementation of get_top_emails): print( '.join (str (t) for t in get top_emails ([headerl, header2, header3], 3))) The output produced by the 1line above when executed with the model implementation of get top emails() was as follows: (3, tom.kearney@enron.com) (2, 'mike.mcconnel1@enron.com) (2, george.mcclellaneenron.com) Question 6: Spark (10%) Write a function proc headers (1st, N) that applies further transformations to the dataset returned by the proc_ headers (lst) function from Question 5 to identify N Email addresses, which are most popular in terms of the number of Emails that where sent to them. The output must be a list of N tuples ( n, E) where n is the number of Email transmissions having E as their recipient address. The list must be sorted in the descending lexicographical order, that is, (nl, E1) > (n2, E2) if and only if either nl > n2 or nl n2 and El > E2 Hint: Use a map/reduceByKey pattern as in the word count example to pair Email addresses with their popularity counts, the sortBy transformation to sort them in the descending lexicographical order, and the take( action to extract the top-N records Note that calling proc_headers) first is only needed to prevent any errors in the implementation of Question 5 from propagating to the solution of this question as this way, we will be able to use the model implementation of proc_headers() for testing. A more efficient solution would avoid materializing the results of proc_headers() in the driver, and instead directly extend the processing steps of Question 5 with further operations. Make sure you understand why it is important! (6): def get-top-emails (1st, N): lst: a list of tuples (FROM, To, CC, BCC) representing EMail headers N: a positive integer Returns a list of N tuples (n, E) sorted in the descending lexicographical order representing the top N most popular EMail destinations as described in the question rddsc.parallelize(proc_headers (lst)) # Insert your code after this line You can use the following code to test your impementation of get_top_emails): print( '.join (str (t) for t in get top_emails ([headerl, header2, header3], 3))) The output produced by the 1line above when executed with the model implementation of get top emails() was as follows: (3, tom.kearney@enron.com) (2, 'mike.mcconnel1@enron.com) (2, george.mcclellaneenron.com)