Question

1 Approved Answer

Posted on Nov 06, 2024

a) Let's start by building a logistic regression model. i. The dataset also has a 0/1 training variable. Using this variable, divide the data set

a) Let's start by building a logistic regression model. i. The dataset also has a 0/1 training variable. Using this variable, divide the data set into a training set (70%) and a holdout set (30%). What is the accuracy on the holdout set of a simple base line model that predicts that all loans will be paid back in full (NotFullyPaid =0)? We will try to build a model that adds value over this simple baseline model.

ii. Now, build a logistic regression model that predicts the dependent variable NotFullyPaid using all the other variables as independent variables. Use the training set as the data to build the model. Which of the independent variables are significant? Do not drop the insignificant variables from the model yet.

iii. Consider two loan applications, which are identical other than the fact that the borrower in Application A has a FICO credit score of 700 while the borrower in Application B has a FICO credit score of 710. Let Logit(A) be the value of the log odds of loan A not being paid back in full, according to the model the built in (ii), and define Logit(B) similarly for loan B. What is Logit(A)-Logit(B)? (Hint: To answer this question, find the coefficient in front of FICO score in the model and multiply it by 710-700=10.)

iv. The change in log odds the calculated in part (iii) is easy to calculate but hard to interpret. Now, predict the probability of default for Application A and Application B using Radiant. Go to Predict>Predict input type= Command and type fico=c(710,700) in the Predict command box. Radiant will use either the mean or the most frequent values in the training data set for the remaining explanatory variables and will display the probability of default for the average person in the data set whose FICO score changes from 700 to 710. What are these probabilities?

v. Now predict the probability of the holdout set loans not being paid back in full. Store these predicted probabilities in a variable named PredictedRisk and add it to the holdout set (we will use this variable later on). What is the accuracy of the logistic regression model on the holdout set using a threshold of 0.5? How does this compare to the baseline model? (Hint: Select the holdout data as the data set, go to Radiant>Evaluate Classification, decide on the response variable and select stored predictions and use Cost=1 and Margin=2 to set the threshold at 0.5. Confusion tab should give the accuracy.)

vi. What is the holdout set AUC of the model? Given the accuracy and the AUC of the model on the holdout set, do think the model could be useful to an investor to make profitable investments?

Link to the dataset:

https://drive.google.com/file/d/16wR9LqWSBwZTUZM5A5rYcFhfj8kqv864/view?usp=share_link