Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 02, 2023

Create an RDD pipline to show the count of each part-of-speach tag sorted in descending order Complete the implementation of the pos_counts() function below

Create an RDD pipline to show the count of each part-of-speach tag sorted in descending order Complete the implementation of the pos_counts() function below so that it uses an RDD pipeline (i.e. sequence of transformations) to: 1. filter out blank lines 2. filter outlines starting with "URL" 3. create a single list (using flatMap) that applies the pos_tag_counter() function (this is defined below for you below) to each line 4. map each resulting element to show the part of speech (which is the second element returned from the pos_tag_counter) 5. convert each resulting element to a pairRDD with POS tags as keys and values of 1 6. reduce the resulting RDD by key, adding up all the 1s (like the lecture and lab examples) 7. sort the resulting list by the counts, in descending order 11:# This is the function you will use with flatMap in your pipeline. TOKEN_RE = re.compile(r"\b[\w]+\b") def pos_tag_counter(line): toks = nltk.regexp_tokenize (line, TOKEN_RE) postoks nltk.tag.pos_tag(toks) return postoks [1: def pos_counts (rdd): # YOUR CODE HERE raise Not ImplementedError() return pos_total_sorted # This should be the final stage of your pipeline, an RDD with the # count of each part-of-speach tag sorted in descending order. Let's start by trying your code on a small data set. The text_to_be_analyzed from the cells above will do nicely. We can use the parallelize() method to turn it into a RDD, pass that to your function, and then take() the first ten entries: II:small_text= sc.parallelize (text_to_be_analyzed.split(" ")) small_pos_counts = pos_counts (small_text) 11: small_pos_counts_take_10= small_pos_counts.take (10) small_pos_counts_take_10 11: # Autograder cell. This cell is worth 2 points (out of 280). This cell does not contain hidden tests. # This cell deliberately includes answers to provide guidance on how this question is graded. correct = AutograderHelper.parse_spark_take([ ('NN', 30), ('NNP', 24), ('IN', 20), ('DT', 16), (VRD 11) Question 2 -- Noun Phrase Length Task: Create an RDD pipeline to show the distribution of the length of noun phrases Complete the implementation of the noun_phrase_length_distribution () function below so that it uses an RDD pipeline to return a PairRDD which contains the distribution of the length of noun phrases. The function get_noun_phrases () is defined below for you. You can use flat Map() to apply get_noun_phrases () to each entry in the input RDD. Sorting the resulting list by the counts in descending order will make the results easier to interpret. Note that the filters from the previous question (PART-OF-SPEECH COUNT) are not needed here. That is: you don't need to remove blank lines or lines that start with "URL". [ ]: # This cell defines the get_noun_phrases() function you will use with flatMap() grammar="" NBAR: ww NP: *** ] { * } def get_noun_phrases (line): This function returns a list of lists of tuples. Each entry (list of tuples) is a breakdown of a noun phrase, and each tuple contains the word and a code for the noun phrase part. For example, get_noun_phrases ("The quick brown fox, jumps over the lazy dog.") returns: [ www { } { } [('brown', 'NN'), ('fox', 'NN')], [('dog', 'NN')] TOKEN_RE= re.compile(r"\b[\w']+\b") chunker nltk.RegexpParser (grammar) toks = nltk. regexp_tokenize (line, TOKEN_RE) postoks nltk. tag.pos_tag (toks) if len(postoks) == 0: return [] tree chunker.parse(postoks) return [term for term in leaves (tree)] def leaves (tree): for subtree in tree. subtrees(filter = lambda t: t. label()== 'NP') : yield subtree. leaves () []: def noun_phrase_length_distribution (rdd): # YOUR CODE HERE raise NotImplementedError() return distribution # This should be the final stage of your pipeline, a PairRDD with the # distribution of the length of noun phrases. [ ]:small_counts_take_10=small_counts.take (10) small_counts_take_10 The cell above should produce this output: This means there are 29 1-word noun phrases, 10 2-word noun phrases, 3 3-word noun phrases, and 2 4-word noun phrases in the small_text data set. [(1, 29), (2, 10), (3, 3), (4, 2)] [ ]: # Autograder cell. This cell is worth 2 points (out of 20). This cell does not contain hidden tests. # This cell deliberately includes answers to provide guidance on how this question is graded. correct AutograderHelper.parse_spark_take( [(1, 29), (2, 10), (3, 3), (4, 2)] ) AutograderHelper.assert_same_shape ( correct correct, submitted=Autog raderHelper.parse_spark_take (small_counts_take_10), ) Now let's run it against the larger data set. The complete analysis could take about 10 minutes to run. []: text = sc.textFile('../../assets/data/nytimes/nytimes_news_articles.txt') counts = noun_phrase_length_distribution (text) [ ]: counts_take_10 = counts.take (10) counts_take_10 [ ]: assert counts_take_10 [0]== (1, 1205976), \ "The first item in the result is not correct." [ ]: # Autograder cell. This cell is worth 8 points (out of 20). This cell contains hidden tests.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

To complete the implementation of the poscounts function and the nounphraselengthdistribution function using RDD pipelines in PySpark you can follow t... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Using Excel & Access for Accounting 2010

Authors: Glenn Owen

3rd edition

1111532672, 978-1111532673

More Books

Students also viewed these Programming questions

Question

What is the magnetic field where the electrons are traveling that results from this 1 A current in the Helmholtz coils? Hint: Evaluate the expression given for the field between two Helmholtz coils...

Answered: 1 week ago

Question

★★★★★

In the last chapter you added and modified some tables for Kelly's Boutique. She would now like you to create, run, and print some select, parameter, and action queries. Make the following changes...

Answered: 1 week ago

Question

★★★★★

ConAir (CA) commercial airlines company requires you experience in database design and have approached you to assist in the design of their new airline database system. They urgently need to have...

Answered: 1 week ago

Question

★★★★★

Oliver owns Wifit, an unincorporated sports store. Wifit earned $100,000 before Oliver drew out a salary of $60,000. What is Oliver's deduction for self-employment taxes? Group of answer choices A....

Answered: 1 week ago

Question

★★★★★

The following section is taken from Centralia Oil Company's balance sheet at December 31, 2011. Interest is payable annually on January 1. The bonds are callable on any annual interest date....

Answered: 1 week ago

Question

★★★★★

Find recent examples of bad practice in social public relations and suggest how the organizations concerned could have handled the situation better.

Answered: 1 week ago

Question

★★★★★

3. Appreciate a theory that pulls together both specific and general theories of offending (integrated cognitive antisocial potential theory).

Answered: 1 week ago

Question

★★★★★

The following transactions and events affected a Special Revenue Fund of Stem Independent School District during 20X4. 1. The chief accountant discovered that (a) the $20,000 proceeds of a sale of...

Answered: 1 week ago

Question

★★★★★

LePage Co. expects to earn $2.20 per share during the current year, its expected payout ratio is 40%, its expected constant dividend growth rate is 6.0%, and its common stock currently sells for...

Answered: 1 week ago

Question

★★★★★

Grateful Eight Co. is expected to maintain a constant 3.7 percent growth rate in its dividends indefinitely. If the company has a dividend yield of 5.6 percent, what is the required return on the...

Answered: 1 week ago

Question

★★★★★

Multiple Choice Don is preparing for a trip to Vietnam and Laos. He will spend some time in the big cities with full access to banking services, but he will spend more than half of his time in...

Answered: 1 week ago

Question

★★★★★

Pharoah Corporation was organized on January 1, 2021. During its first year, the corporation issued 2,000 shares of $50 par value preferred stock and 104,000 shares of $10 par value common stock. At...

Answered: 1 week ago

Question

★★★★★

When we analyze data for the census tracts in the greater Los Angeles area, we find no significant correlation between median tax bill and median lot size. Yet a considerable positive correlation...

Answered: 1 week ago

Question

★★★★★

From a point along a straight road, the angle of elevation to the top of a hill is 36. From 270 ft farther down the road, the angle of elevation to the top of the hill is 23. How high is the hill?...

Answered: 1 week ago

Question

★★★★★

Gary is awarding a gift card randomly to a list of names on a spreadsheet. To keep it fair he wants to draw a number and then award it to the person located at that number on the list. What function...

Answered: 1 week ago

Question

★★★★★

According to the ISA 300 'Planning and Audit of Financial Statement'. Outline the overall content of this audit strategy and the audit plan.

Answered: 1 week ago

Question

★★★★★

Compute the minimum clock period if the clock - to - Q delay ( tclk - q ) = ( N + O + P ) for the following flipflops. The flipflops have a setup time ( tsu ) of 0 . 7 ns . Delay through each...

Answered: 1 week ago

Question

★★★★★

Select a mass spectrometric technique with the highest mass resolution for identifying an unknown compound being eluted from a liquid chromatography column

Answered: 1 week ago

Question

★★★★★

For each buyer name, create a form (using the Forms Wizard) that contains the product name, quantity, and unit price of each product that the buyer is responsible for. Use the sub form option with a...

Answered: 1 week ago

Question

★★★★★

Create new queries for Coast Jewelers a. Create and print a query for Coast Jewelers that lists the supplier name, contact name, and phone number of every supplier. Save the query as Supplier Query...

Answered: 1 week ago

Question

★★★★★

Why does the typical cash budget spread payment of purchases over more than one month?

Answered: 1 week ago

Question

★★★★★

Which mnemonic involves first memorizing a series of numbered words? a. linking b. peg-word c. method of loci d. verbal/rhythmic organization

Answered: 1 week ago

Question

★★★★★

Keela has finished a draft of her research paper almost 2 weeks before the date it is due. What should she do now? a. Let it sit for a few days before reviewing it. b. Complete the final draft...

Answered: 1 week ago

Question

★★★★★

Tabitha is stuck on a question while taking her psychology exam. What should she do? a. Stay on that question until she can figure out the answer. b. Go on to the other questions. Maybe she can find...

Answered: 1 week ago

Previous Question Next Question