Question
you are required to answer five questions using Spark. For each question you will need to draft a code which uses the appropriate transformations and
you are required to answer five questions using Spark. For each question you will
need to draft a code which uses the appropriate transformations and actions. You are free to use RDD
and/or DataFrame to answer the questions.
Our main input file for this assessment is called CreditCard.csv, which contains customer information
collected from within a consumer credit card portfolio. There is a second file provided, called
DataDescription.txt, which describes the columns in the main dataset.
To prepare the data, first, make a directory for this assessment called test3 within the /home/prac/
directory as we normally have in practicals.
Then, copy the data file into your /home/prac/test3/input folder.
Lastly, make the /home/prac/test3/src directory for your Python code and /home/prac/test3/output
directory for writing the result file. The result file is expected to be generated in the directory specified
in the template file (/home/prac/test3/output/result.txt). From here you should be able to follow the
directions in Practical 4&5 to write and run your PySpark programs.
The program will be run using the following command when marking:
$ spark-submit /home/prac/test3/src/test3_solutions.py
The command should write the output of your program in the /home/prac/test3/output directory as
specified in the template file. Your program will be marked by comparing the result from your program
to the correct answer. Rounding the result is not required, but you will not lose marks if you do so.
Q1. Load the data, convert to DataFrame and apply appropriate column names and variable types.
Q2. Determine what proportion of all customers is attributed to each education level in the dataset i.e.
Uneducated = x%, High School = y% etc.
This question uses the Education_Level field.
Q3. Determine which income category has the least amount of total credit limit.
This question uses the Income_Category and Credit_Limit fields.
Q4. For each card category, determine the average transaction amount.
(Tip. use Total transaction amount / Total transaction count to calculate average)
This question uses the Card_Category, Total_Trans_Amt and Total_Trans_Ct fields.
Q5. What is the longest months on book among male customers who is 40 years of age or above?
This question uses the Customer_Age, Gender and Months_on_book fields.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Heres a draft of the PySpark code to answer the five questions from pysparksql import SparkSession from pysparksqlfunctions import col count sum avg w...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started