Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Task 1 : Import the raw data ( CC _ Default.csv ) into your Jupyter notebook. 1 . 1 Check if the data is loaded

Task 1:
Import the raw data (CC_Default.csv) into your Jupyter notebook.
1.1 Check if the data is loaded correctly by printing a few observations. Check the total number of observations and variables.
1.2 Provide the descriptive statistics and manipulate data.
a. Check for missing values if any.
b. Plot the univariate distribution.
c. Convert the relevant variables such as payment variables (Pay0-Pay6 and customer related variables) to categorical variables as appropriate.
1.3 Find the variables that are correlated and the variables that might help in finding the defaulters next month using a few plots. The plots should provide insights on the following:
a. The independent variable that should help identify those who will default from the next months credit card payment
b. The relation between dependent and independent variables
c. The correlations among the variables, etc.
1.4 Provide your insights into the variables and their relationship based on your analysis in Task 1.3 in a markdown cell in your Jupyter notebook.
Task 2:
Import Train.csv into your Jupyter notebook.
2.1 Check the total number of observations and print a few records. Please note that the variable conversion in the raw data, similar to Task 1.2 should be applied
Hint: Convert the relevant variables such as payment variables, Pay0-Pay6, and customer related variables (demographic) to categorical variables as appropriate.
2.2 Fit a logistic regression after making the dataset balanced.
Hint: Use class weight parameter.
2.3 Remove the variable(s) that would cause multicollinearity. Explicitly state the variable(s) that you are dropping in a markdown cell in your Jupyter notebook.
Hint: To remove a variable, use the drop function.
Import Test.csv into your Jupyter notebook.
2.4 Test the model on the test dataset. Please note that the variable conversion in the raw data, similar to Task 1.2 should be applied.
Hint: Convert the relevant variables such as payment variables, Pay0-Pay6, and customer related variables (demographic) to categorical variables as appropriate.
2.5 Plot the confusion matrix.
2.6 Provide your insights on accuracy, precision and F1 Score in a markdown cell in your Jupyter notebook.
Task 3:
3.1 Fit a random forest model on Train.csv with a random state of 1,500 epochs, a maximum depth of 3 and a maximum feature of 3.
3.2 Evaluate the confusion matrix, F1 scores and accuracy. Compare the random forest model with the logistic regression from Task 2. State your observations in a markdown cell in your Jupyter notebook.
Task 4:
4.1 Fit support vector machine (SVM) algorithms on Train.csv with the following parameters: gamma =0.025 ; C=3.
4.2 Provide the confusion matrix, F1 scores and accuracy in a markdown cell in your Jupyter notebook.
Task 5:
5.1 Fit an ANN model (sequential) with 16 input neurons and add two hidden layers with 8 neurons each.
5.2 Use relu activation and adam optimiser. Use the normal kernel initialiser. Run it for 100 epochs on train.csv with a batch size of 15.
5.3 Provide the confusion matrix, F1 scores and accuracy on the test dataset in a markdown cell in your Jupyter notebook.
Task 6:
Explain which model you will use based on the evaluation metrics on the test dataset among all the models from Task 2 to Task 5(logistic regression, random forest, SVM and ANN) and explain why. Put your answer in a markdown cell in your Jupyter notebook.
About the data
The CC_Default.csv "Train.csv" "Test.csv" dataset contains a total of 25 variables which are the following:
ID A numerical value assigned to each credit card customer
LIMIT_BAL: The remaining credit a customer can use
LIMIT_BAL =(credit limit - used up amount)
SEX 1= male ; 2= female
EDUCATION A customers educational attainment:
1= graduate school
2= university
3= high school
4= others
5= unknown
6= unknown
MARRIAGE A customers marital status:
0= unknown
1= married
2= single
3= others
AGE A customers age in years
PAY_0 Repayment status in September 2005:
0 or less: Paid duly
1 or greater = the number indicates the number of months the payment was delayed
PAY_2
Repayment status in August 2005
0 or less: Paid duly
1 or greater =
PAY_3 Repayment status in July 2005
0 or less: Paid duly
1 or greater =
PAY_4 Repayment status in June 2005
0 or less: Paid duly
1 or greater =
PAY_5 Repayment status in May 2005
0 or less: Paid duly
1 or greater =
PAY_6 Repayment status in April 2005
0 or less: Paid duly
1 or greater =
BILL_AMT1
BILL_AMT2
BILL_AMT3
BILL_AMT4
BILL_AMT5
BILL_AMT6
PAY_AMT1
PAY_AMT2
PAY_AMT3
PAY_AMT4
PAY_AMT5
PAY_AMT6
default.payment.next.month
Shows customers who defaulted on their payments on the following month: 1= yes 0= no

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions