Answered step by step
Verified Expert Solution
Link Copied!
Question
1 Approved Answer

You are doing initial exploratory analysis in PySpark and one of the sources you need to include resides in a PostgreSQL database. Using the

 

You are doing initial exploratory analysis in PySpark and one of the sources you need to include resides in a PostgreSQL database. Using the provided notebook, answer the following questions within Google Colaboratory and submit your answers on SUNLearn and Git. 1. Using a PySpark dataframe, print the schema of customer table in the pagila PostgreSQL database by utilising a JDBC connection. (3) 2. Use the Spark SQL API to query the customer table, compute the number of unique email addresses in that table and print the result in the notebook. (3) 3. Repeat this calculation using only the Dataframe API and print the result. (1) 4. How many partitions are present in the dataframe resulting from Question 6.3 (additionally provide the code necessary to determine that). (1) 5. Compute the min and max of customer.create_date and print the result (once more using the Spark DataFrame API and not the Spark SQL API). (1) 6. Determine which first names occur more than once 1. using the Spark SQL API (printing the result), and (1) 2. using the Spark Dataframe API (printing the result once more). (1) 7. Port the PostgreSQL below to the PySpark DataFrame API and execute the query within Spark (not directly on PostgreSQL): (5) 1 SELECT 2 3 4 5 FROM payment 6 staff.first_name .staff.last_name ,SUM (payment. amount) INNER JOIN staff ON payment. staff_id staff.staff_id 7 WHERE payment.payment_date BETWEEN 2007-01-01 AND 2020-02-01 8 GROUP BY 9 staff.last_name 10 .staff.first_name.

Step by Step Solution

3.45 Rating (174 Votes )

There are 3 Steps involved in it

Step: 1

Using a PySpark dataframe print the schema of customer table in the pagila PostgreSQL database by utilising a JpDBC connection Python Import necessary libraries from pysparksql import SparkSession Cre... blur-text-image
Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Spreadsheet Modeling And Decision Analysis A Practical Introduction To Management Science

Authors: Cliff T. Ragsdale

5th Edition

324656645, 324656637, 9780324656640, 978-0324656633

More Books

Students explore these related Computer Engineering questions