Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Question 6: Spark (10%) Write a function proc headers (1st, N) that applies further transformations to the dataset returned by the proc_ headers (lst) function

image text in transcribed

Question 6: Spark (10%) Write a function proc headers (1st, N) that applies further transformations to the dataset returned by the proc_ headers (lst) function from Question 5 to identify N Email addresses, which are most popular in terms of the number of Emails that where sent to them. The output must be a list of N tuples ( n, E) where n is the number of Email transmissions having E as their recipient address. The list must be sorted in the descending lexicographical order, that is, (nl, E1) > (n2, E2) if and only if either nl > n2 or nl n2 and El > E2 Hint: Use a map/reduceByKey pattern as in the word count example to pair Email addresses with their popularity counts, the sortBy transformation to sort them in the descending lexicographical order, and the take( action to extract the top-N records Note that calling proc_headers) first is only needed to prevent any errors in the implementation of Question 5 from propagating to the solution of this question as this way, we will be able to use the model implementation of proc_headers() for testing. A more efficient solution would avoid materializing the results of proc_headers() in the driver, and instead directly extend the processing steps of Question 5 with further operations. Make sure you understand why it is important! (6): def get-top-emails (1st, N): lst: a list of tuples (FROM, To, CC, BCC) representing EMail headers N: a positive integer Returns a list of N tuples (n, E) sorted in the descending lexicographical order representing the top N most popular EMail destinations as described in the question rddsc.parallelize(proc_headers (lst)) # Insert your code after this line You can use the following code to test your impementation of get_top_emails): print( '.join (str (t) for t in get top_emails ([headerl, header2, header3], 3))) The output produced by the 1line above when executed with the model implementation of get top emails() was as follows: (3, tom.kearney@enron.com) (2, 'mike.mcconnel1@enron.com) (2, george.mcclellaneenron.com) Question 6: Spark (10%) Write a function proc headers (1st, N) that applies further transformations to the dataset returned by the proc_ headers (lst) function from Question 5 to identify N Email addresses, which are most popular in terms of the number of Emails that where sent to them. The output must be a list of N tuples ( n, E) where n is the number of Email transmissions having E as their recipient address. The list must be sorted in the descending lexicographical order, that is, (nl, E1) > (n2, E2) if and only if either nl > n2 or nl n2 and El > E2 Hint: Use a map/reduceByKey pattern as in the word count example to pair Email addresses with their popularity counts, the sortBy transformation to sort them in the descending lexicographical order, and the take( action to extract the top-N records Note that calling proc_headers) first is only needed to prevent any errors in the implementation of Question 5 from propagating to the solution of this question as this way, we will be able to use the model implementation of proc_headers() for testing. A more efficient solution would avoid materializing the results of proc_headers() in the driver, and instead directly extend the processing steps of Question 5 with further operations. Make sure you understand why it is important! (6): def get-top-emails (1st, N): lst: a list of tuples (FROM, To, CC, BCC) representing EMail headers N: a positive integer Returns a list of N tuples (n, E) sorted in the descending lexicographical order representing the top N most popular EMail destinations as described in the question rddsc.parallelize(proc_headers (lst)) # Insert your code after this line You can use the following code to test your impementation of get_top_emails): print( '.join (str (t) for t in get top_emails ([headerl, header2, header3], 3))) The output produced by the 1line above when executed with the model implementation of get top emails() was as follows: (3, tom.kearney@enron.com) (2, 'mike.mcconnel1@enron.com) (2, george.mcclellaneenron.com)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Guide To Client Server Databases

Authors: Joe Salemi

2nd Edition

1562763105, 978-1562763107

More Books

Students also viewed these Databases questions

Question

2. Which symptoms of ASPD did Bill have?

Answered: 1 week ago

Question

2. What is the impact of information systems on organizations?

Answered: 1 week ago

Question

Evaluate the impact of technology on HR employee services.

Answered: 1 week ago