Question
Part II: An application (75 marks) 2.1 Background on Credit Card Dataset The data, CreditCard Data.xls, is based on Yeh and hui Lien (2009). The
Part II: An application (75 marks)
2.1 Background on Credit Card Dataset
The data, \CreditCard Data.xls", is based on Yeh and hui Lien (2009). The data
contains 30,000 observations and 23 explanatory variables. The response variable, Y, is a
binary variable where \1" refers to default payment and \0" implies non-default payment.
The description of 23 explanatory variables is as follows:
X1: Amount of the given credit (NT dollar): it includes both the individual con-
sumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (0 = unknown; 1 = graduate school; 2 = university; 3 = high school;
4 = others; 5 = unknown; 6 = unknown).
X4: Marital status (0 = unknown; 1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. The data was tracked the past monthly payment
records (from April to September, 2005) as follows: X6 = the repayment status in
September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the
repayment status in April, 2005. The measurement scale for the repayment status
is: -2= no consumption, -1=pay duly, 0 = the use of revolving credit; 1 = payment
delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay
for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement
in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 =
amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in Septem-
ber, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April,
2005.
2.2Assessment Tasks
2.2.1 Data
(a) Select a random sample of 70% of the full dataset as the training data, retain the
rest as test data. Provide the code and print out the dimensions of the training
data. (5 marks)
4
2.2.2 Tree Based Algorithms
(a) Use an appropriate tree based algorithm to classify credible and non-credible clients.
Specify any underlying assumptions. Justify your model choice as well as hyper-
parameters which are required to be specied in R. (10
marks)
(b) Display model summary and discuss the relationship between the response variable
versus selected features. (10 marks)
(c) Evaluate the performance of the algorithm on the training data and comment on
the results. (5 marks)
2.2.3 Support vector classier
(a) Use an appropriate support vector classier to classify the credible and non-credible
clients. Justify your model choice as well as hyper-parameters which are required
to be specied in R. (10 marks)
(b) Display model summary and discuss the relationship between the response variable
versus selected features. (10 marks)
(c) Evaluate the performance of the algorithm on the training data and comment on
the results. (5 marks)
2.2.4 Prediction
Apply your tted models in 2.2.2 and 2.2.3 to make prediction on the test data. Evaluate
the performance of the algorithms on test data. Which models do you prefer? Are
there any suggestions to further improve the performance of the algorithms? Justify your
answers. (20 marks)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started