Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

you are required to answer five questions using Spark. For each question you will need to draft a code which uses the appropriate transformations and

 you are required to answer five questions using Spark. For each question you will

need to draft a code which uses the appropriate transformations and actions. You are free to use RDD

and/or DataFrame to answer the questions.


Our main input file for this assessment is called CreditCard.csv, which contains customer information

collected from within a consumer credit card portfolio. There is a second file provided, called

DataDescription.txt, which describes the columns in the main dataset.


To prepare the data, first, make a directory for this assessment called test3 within the /home/prac/

directory as we normally have in practicals.

Then, copy the data file into your /home/prac/test3/input folder.

Lastly, make the /home/prac/test3/src directory for your Python code and /home/prac/test3/output

directory for writing the result file. The result file is expected to be generated in the directory specified

in the template file (/home/prac/test3/output/result.txt). From here you should be able to follow the

directions in Practical 4&5 to write and run your PySpark programs.


The program will be run using the following command when marking:

$ spark-submit /home/prac/test3/src/test3_solutions.py

The command should write the output of your program in the /home/prac/test3/output directory as

specified in the template file. Your program will be marked by comparing the result from your program

to the correct answer. Rounding the result is not required, but you will not lose marks if you do so.


Q1. Load the data, convert to DataFrame and apply appropriate column names and variable types.


Q2. Determine what proportion of all customers is attributed to each education level in the dataset i.e.

Uneducated = x%, High School = y% etc.

This question uses the Education_Level field.


Q3. Determine which income category has the least amount of total credit limit.

This question uses the Income_Category and Credit_Limit fields.


Q4. For each card category, determine the average transaction amount.

(Tip. use Total transaction amount / Total transaction count to calculate average)

This question uses the Card_Category, Total_Trans_Amt and Total_Trans_Ct fields.


Q5. What is the longest months on book among male customers who is 40 years of age or above?

This question uses the Customer_Age, Gender and Months_on_book fields.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Heres a draft of the PySpark code to answer the five questions from pysparksql import SparkSession from pysparksqlfunctions import col count sum avg w... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Income Tax Fundamentals 2013

Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill

31st Edition

1111972516, 978-1285586618, 1285586611, 978-1285613109, 978-1111972516

More Books

Students also viewed these Programming questions