Question
Using spark 1. Read the dataset using sqlContext from pyspark.sql import SQLContext sqlContext = SQLContext(sc) spark_df = sqlContext.sql(Select * from Washington_State_HDMA_2016_csv) 2. Compute how many
Using spark
1. Read the dataset using sqlContext
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) spark_df = sqlContext.sql("Select * from Washington_State_HDMA_2016_csv")
2. Compute how many floating and string variables this dataset has
num_float = #
3. Create a new column named denied (help)
Assume that if denial_reason_name_1 column is not null, then the loan application is rejected/denied
Create a new column in the dataset - Name the column as denied
Encode the denied column as 0 if denial_reason_name_1 is null, otherwise encode the denied column as 1
4. Find the percentage of denied loans (help)
Use the new variable named denied in this analysis
What percentage of loans are denied?
Google the average loan application denial rate in the country. Is this number similar to the US average?
5. Compare the income of approved applicants vs rejected applicants (help)
Use applicant_income_000s variable
Calculate the average income for denied = 1 and denied = 0 applicants (you can use groupBy())
What do you think (e.g., approved aplicants make more money?)
If not, this is against our intuition. Why do you think denied applicants make more money?
6. Relationship between sex and application status (help)
Investigate if female applicants have higher rejection rate as compared to male applicants
Find the rejection rate for males and females.
For simplicity, consider rejection rate is number of denied applicants(denied = 1) / number of approved applicants (denied = 0)
Use applicant_sex_name for detemining the sex of the applicant
Any comments?
7. Relationship between race and application status (help)
Investigate the relationship between the applicants race and the loan status.
You can use the denied column you have created and applicant_race_name_1 column
For each race, find the ratio of denied loans
Consider the ratio of denied loans as the number of denied applicants(denied = 1) / number of approved applicants (denied = 0)
What are your comments? Which race has the highest denied ratio?
8. Check loan_income_ratio (help)
Let's do some more deep down analysis
Let's create a new variable by dividing the loan_amount_000s with applicant_income_000s
Name this variable loan_to_income_ratio
Let's check if the denied loans are the ones with high loan_to_income_ratio.
What are your thoughts?
hint: logically, we expect that the denied loans should have higher loan_to_income_ratio. Is this the case? Include race variable into the analysis. What do you think about the relationship among applicant_race_name_1, loan_to_income_ratio, denied variables?
9. What is the most common denial reason (help)
Use the denial_reason_name_1 variable
Google the most common mortgage denial reasons. Did you get similar results?
10. Give at least 3 more insights (help)
Give us more insights. Use your intuition and do some more analysis to give us more insights about the dataset.
Feel free to experiment
You can use python visualization tools
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started