Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 02, 2023

Determine how many users have received more than 5000 cool compliments. Create a variable user_count (an integer) which contains the number of user with

Determine how many users have received more than 5000 "cool" compliments. Create a variable user_count (an integer) which contains the number of user with more than 5000 "cool" compliments (using the compliment_ field.) In [ ] # YOUR CODE HERE raise NotImplementedError() In [ ] assert type (user_count) == int, "The user_count variable should be an integer." In [ ] # Autograder cell. This cell is worth 2 points (out of 20). This cell contains hidden tests. Question 2 Determine the top 5 most useful positive reviews. Create a variable top_5_useful_positive. This should be a PySpark DataFrame . For this question a "positive review" is one with 4 or 5 stars The DataFrame should be ordered by useful and contain 5 rows The DataFrame should have these columns (in this order): review_id useful I stars. -- In [ ] # YOUR CODE HERE Useful Positive Reviews In [] import pyspark raise NotImplementedError() assert type (top_5_useful_positive) pyspark.sql.dataframe.DataFrame, \ "The top_useful_positive variable should be a Spark DataFrame." == assert top_5_useful_positive.columns "The columns are not in the correct order.' ['review_id', 'useful', 'stars'], \ submitted = AutograderHelper.parse_spark_dataframe (top_5_useful_positive) assert len (submitted) == 5, \ In [ ] # Autograder cell. This cell is worth 1 point (out of 20). This cell does not contain hidden tests. # This cell deliberately includes answers to provide guidance on how this question is graded. "The result must have 5 rows." top_useful_review_id = "11GX1yq4MALOMx17vpBcOQ" assert submitted [ "review_id"][0] ==top_useful_review_id, \ f'The first row should have review_id "{top_useful_review_id}" (this review has the most "useful" ratings) In [] #Autograder cell. This cell is worth 4 points (out of 20). This cell contains hidden tests. Question 3 -- Checkins Determine what hours of the day most checkins occur. Create a variable hours_by_checkin_count. This should be a PySpark DataFrame The DataFrame should be ordered by count and contain 24 rows The DataFrame should have these columns (in this order): hour (the hour of the day as an integer, the hour after midnight being 0) count (the number of checkins that occurred in that hour) Note that the date column in the checkin data is a string with multiple date times in it. You'll need to split that string before parsing. In [ ] # YOUR CODE HERE raise NotImplementedError() In [ ]: assert type (hours_by_checkin_count) == pyspark.sql.dataframe. DataFrame, "The hours_by_checkin_count variable should be a Spark DataFrame." assert hours_by_checkin_count.columns == ["hour", "count"], \ "The columns are not in the correct order." submitted = AutograderHelper.parse_spark_dataframe (hours_by_checkin_count) In [ ] #Autograder cell. This cell is worth 1 point (out of 20). This cell does not contain hidden tests. assert len (submitted) == 24, \ "The hours_by_checkin_count DataFrame must have 24 rows. assert submitted [ "hour"][0] == 1, \ 'The first row should have hour 1' In [ ] # Autograder cell. This cell is worth 4 points (out of 20). This cell contains hidden tests. Question 4 -- Common Words in Useful Reviews Write function that takes a Spark DataFrame as a parameter and returns a Spark DataFrame of the 50 most common words from useful reviews and their counts. A "useful review" has 10 or more "useful" ratings. . Convert the text to lower case. Use the provided splitter() function in a UDF to split the text into individual words. Exclude the words in the provided STOP WORDS set. Returned DataFrame should have these columns (in this order): word . count Returned DataFrame should be sorted by count in descending order. . } . In [ ] import re def splitter (text): WORD_RE= re.compile(r"[\w']+") return WORD_RE.findall(text) STOP WORDS = { a "about", "above", "after", "again", "against", "aint", "all", "also", "although", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "check", "checked", "could", "did", "do", "does", "doing", "don", "down", "during", "each", "few", "for", "from", "further", "get", "go", "got", "had", "has", "have", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "however", "i", "i'd", "if", "i'm", "in", "into", "is", "it", "its", "it's", "itself", "i've", "just", "me", "more", "most", "my", "myself", "no", "nor", "not", "now", "of", "off", "on", "once", "one", "online", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "paid", "place", "s", "said", "same", "service", "she", "should", "so", "some", "such", "t", "than", "that", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "us", "very", "was", "we", "went", "were", "we've", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "would", "you", "your", "yours", "yourself", "yourselves", def common_useful_words (reviews, limit=50): #YOUR CODE HERE raise Not ImplementedError() return most common Now we'll run it on the review DataFrame In [ ]: common_useful_words_counts = common_useful_words (review) In [ ] assert type (common_useful_words_counts) == pyspark.sql.dataframe.DataFrame, \ "The common_useful_words_counts variable should be a Spark DataFrame." assert common_useful_words_counts.columns == ["word", "count"], \ "The columns are not in the correct order." submitted = AutograderHelper.parse_spark_dataframe (common_useful_words_counts) In [ ] # Autograder cell. This cell is worth 2 points (out of 20). This cell does not contain hidden tests. assert len (submitted) == 50, \ "The common_useful_words_counts DataFrame must have 50 rows." assert submitted [ "word"][0] == 'like', \ 'The first row should have word "like"" assert submitted [ "count"][0]==101251, \ 'The first row should have count 101251' In [ 1: # Autograder cell I This cell is worth 6 points out of 201 This cell contains hidden tests.

Step by Step Solution

★★★★★

3.38 Rating (148 Votes )

There are 3 Steps involved in it

Step: 1

1Here is the code to determine how many users have received more than 5000 cool compliments PYTHON from pysparksql import SparkSession spark SparkSessionbuilderappNameCool ComplimentsgetOrCreate Load ... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Business Analytics Methods Models And Decisions

Authors: James R. Evans

2nd Edition

321997824, 978-1119298588, 978-0321997821

More Books

Students also viewed these Programming questions

Question

The forces in (Figure 1) act on a 1.1 kg object. Part A What is the value of a, the z-component of the object's acceleration? Express your answer with the appropriate units. Figure 3.0 N' 4.0 N y 3.0...

Answered: 1 week ago

Question

★★★★★

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

Answered: 1 week ago

Question

★★★★★

4. The period of Jupiters moon lo is 1.5 x 10's and has a radius of orbit of 4.2 x 108 m calculate the mass of Jupiter using this information (1.9 x 1027 kg) 5. A lunar lander is to be placed in...

Answered: 1 week ago

Question

★★★★★

Bruceton Farms Equipment Company had goodwill valued at $80 million on its balance sheet at year-end. A review of the goodwill by the company's CFO indicated that the goodwill was impaired and was...

Answered: 1 week ago

Question

★★★★★

This trial balance of Lagerstrom Company does not balance. Your review of the ledger reveals that each account has a normal balance. You also discover the following errors.1. The totals of the debit...

Answered: 1 week ago

Question

★★★★★

Define search, experience, and credence attributes.

Answered: 1 week ago

Question

★★★★★

Explain why needs motivate our behavior.

Answered: 1 week ago

Question

★★★★★

Sinclair Products Inc. desires to earn an after-tax income of $150,000. It has fixed costs of $1,000,000, a unit sales price of $500, and unit variable costs of $200. The company is in the 30% tax...

Answered: 1 week ago

Question

★★★★★

A company wants to build a shopping mall for $120 million. It will generate constant revenues of $29 million per year. The annual expenses on maintenance etc are $3 million. The company expects the...

Answered: 1 week ago

Question

★★★★★

Answered: 1 week ago

Question

★★★★★

Determine the margin of error for a 99% confidence interval to estimate the population mean when s= 36 for the sample sizes below. a) n=14 b) n=26 c) n=54 a) The margin of error for a 99% confidence...

Answered: 1 week ago

Question

★★★★★

Entrepreneur and firm founder Oscar Farinetti has said that Eataly's "informal style of communication shows how direct and approachable we are, just the way people like." A focus on customer...

Answered: 1 week ago

Question

★★★★★

Bank Organizer Printers Inc, produces luxury checkbooks with three checks and stubs per page. Each checkbook is designed for an individual customer and is ordered through the customers bank. The...

Answered: 1 week ago

Question

★★★★★

PA1 LO 10.1 Artisan Metalworks has a bottleneck in their production that occurs within the engraving department. Jamal Moore, the COO is considering hiring an extra worker, whose salary will be...

Answered: 1 week ago

Question

★★★★★

CASE STUDY Should Packing Be Postponed to the DC? Penang Electronics (PE) is a contract manufacturer that produces and packages private-label products for several retail chains, including Target,...

Answered: 1 week ago

Question

★★★★★

6/6/22, 10:45 AM 3.4.5 Practice: Higher or Lower? Apex Learning - Practice Assignment U.S. and Global Economics (2019) Practice Points Possible: 30 Name: Date: Choose one of the scenarios below and...

Answered: 1 week ago

Question

★★★★★

Suppose you purchase 1,350 shares of stock at $36 per share with an initial cash investment of $21,000. The call money rate is 5 percent and you are charged a 1.5 percent premium over this rate. a....

Answered: 1 week ago

Question

★★★★★

Akramin just graduated with a Master of Engineering in Manufacturing Engineering and landed a new job in Melaka with a starting salary of RM 4,000 per month. There are a number of things that he...

Answered: 1 week ago

Question

★★★★★

A national homebuilder builds single family homes and condominium style townhouses. The Excel file House Sales provides information on the selling price, lot cost, type of home, and region of the...

Answered: 1 week ago

Question

★★★★★

Fuller Legal Services wants to determine how much time to allocate to four different services: business consulting, criminal work, nonprofit consulting, and wills/ trusts. Mr. Fuller has determined...

Answered: 1 week ago

Question

★★★★★

In an example in, we developed the following cross tabulation of sales transaction data: a. Find the marginal probabilities that a sale originated in each of the four regions and the marginal...

Answered: 1 week ago

Question

★★★★★

9. What would be the effect on eating of a drug that blocks NPY receptors? One that blocks CCK receptors?

Answered: 1 week ago

Question

★★★★★

10. In what ways does the lateral hypothalamus facilitate feeding?

Answered: 1 week ago

Question

★★★★★

4. What is the evidence that stomach distension, though sufficient, is not necessary for satiety?

Answered: 1 week ago

Previous Question Next Question