FI 4090 Assignment 3 Logistic Regression Part 1 (100 points) Analyze the data in the CreditCard The following variables are included in the dataset 1 card was the application for a card accepted (Binary 1 0) Response Variable 2 reports Number of major derogatory reports 3 income Yearly income (in USD 10,000) 4 Age Age in years plus 12ths of a year 5 Owner Does the individual own his her home 6 dependents number of dependents 7 months Months living at current address 8 share ratio of monthly credit card expenditure to yearly income 9 selfemp Is the individual self employed 10 majorcards number of major credit cards held 11 active number of active credit accounts 12 expenditure average monthly credit card expenditure Use variables 2 to 8 to determine (as listed above) which of the predictors influence the probability that an application is accepted Provide summary stat of the predictors (5 points) There are some values of variable age under one year Consider data with age 18for your analysis for the rest of the questions (5 points) Plot of income vs reports(Number of major derogatory reports) mark individuals with card application accepted as blue, and not accepted as red (5 points) Boxplots of income as a function of card acceptance status Boxplots of reports as a function of card acceptance status (mark card application accepted as blue, and not accepted as red) (Display two boxplots in same page) (10 points) Construct the histogram for the predictors (2 to 8 in the list above) (5 points) Note that share is highly right skewed, so log(share)will be used in the analysis reports is also extremely right skewed (most values of reports are 0 or 1, but the maximum value is 14 To reduce the skewness, log(reports 1)will be used for your analysis Highly skewed predictors have high leverage points and are less likely to be linearly related to the response Use variables 2 to 8 to determine which of the predictors influence the probability that an application is accepted Use the summary function to print the results (10 points) Does the predictors appear to be statistically significant If so, which ones Explain how each of the significant predictors influences the response variable To predict whether the application will be accepted or not, convert the predicted probabilities into class labels yes with the following condition probs 5 yes Compute the confusion matrix and overall fraction of correct predictions (30 points) Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression (false positive, false negative, overall correct predictions) Now fit the logistic regression model using a training data for observations 1 to 1000 Compute the confusion matrix and the overall fraction of correct predictions for the test data (that is, the data for observations 1001 to end of data ) (30 points) Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression (false positive, false negative, overall correct predictions) show a dummy variables based onCreditCard (original)data cbind() the dummy variables with CreditCard (original) data You can give another name if you like (DO NOT cbind() only the variables, that will create a matrix and not a dataframe) generate the summary () using the data created in step 1 Part A show CreditCard Adults for age 18based on the data created in step 1 (make sure it is a dataframe) Part B Try CreditCard Adults subset(CreditCard, Age 18) There are many other ways you can create this CreditCard Adults Attach theCreditCard Adults to R environment For all the following steps (C H) use CreditCard Adults (this should be adata frame check class( CreditCard Adults ) to see the type of object) Always use the function length() to check the length a vector and use dim() to check the length of dataframe Use log(share) and log(reports 1) in all the regression Hints on Part H show train and test data set using the hint below train CreditCard Adults 1 1000, test CreditCard Adults 1001 nrow(CreditCard Adults), Fit the logistic model on the train data using theglm() function Compute model accuracy (similar to what you did in part G) but on test data Use test data in the predict() function to predict the card acceptance probability on the test observations on the basis of the predicted model in step 2 in the hint above To predict whether a card will be accepted (yes no), convert the predicted probabilities into class labels yes or no on the test data Rememberglm pred test rep( no , ) here should be number of rows in the test data Compute the accuracy of the model on the test observations based on predicted (yes no) and actual (yes no) of card acceptance finally For table () and mean() use glm pred test and test card (the response variable for the test data) check the length of both to make sure that they match and equal to the number of rows in test data set

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Oct 10, 2024

FI 4090 Assignment 3: Logistic Regression Part 1 (100 points) Analyze the data in the CreditCard The following variables are included in the dataset: 1.

FI 4090

Assignment 3: Logistic Regression

Part 1 (100 points)

Analyze the data in the CreditCard

The following variables are included in the dataset:

1. card: was the application for a card accepted? (Binary: 1/0) Response Variable

2. reports: Number of major derogatory reports

3. income: Yearly income (in USD 10,000)

4. Age: Age in years plus 12ths of a year

5. Owner: Does the individual own his/her home?

6. dependents: number of dependents

7. months: Months living at current address

8. share: ratio of monthly credit card expenditure to yearly income

9. selfemp: Is the individual self-employed?

10. majorcards: number of major credit cards held

11. active: number of active credit accounts

12. expenditure: average monthly credit card expenditure

Use variables 2 to 8 to determine (as listed above) which of the predictors influence the probability that an application is accepted.

Provide summary stat of the predictors. (5 points)

There are some values of variable age under one year. Consider data with age>18for your analysis for the rest of the questions. (5 points)

Plot of income vs. reports(Number of major derogatory reports): mark individuals with card application accepted as blue, and not accepted as red. (5 points)

Boxplots of income as a function of card acceptance status. Boxplots of reports as a function of card acceptance status (mark card application accepted as blue, and not accepted as red). (Display two boxplots in same page). (10 points)

Construct the histogram for the predictors (2 to 8 in the list above). (5 points)

Note that share is highly right-skewed, so log(share)will be used in the analysis. reports is also extremely right skewed (most values of reports are 0 or 1, but the maximum value is 14. To reduce the skewness, log(reports+1)will be used for your analysis. Highly skewed predictors have high leverage points and are less likely to be linearly related to the response.

Use variables 2 to 8 to determine which of the predictors influence the probability that an application is accepted. Use the summary function to print the results. (10 points)

Does the predictors appear to be statistically significant? If so, which ones? Explain how each of the significant predictors influences the response variable.

To predict whether the application will be accepted or not, convert the predicted probabilities into class labels yes with the following condition: probs >.5="yes". Compute the confusion matrix and overall fraction of correct predictions. (30 points)

Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression (false positive, false negative, overall correct predictions).

Now fit the logistic regression model using a training data for observations 1 to 1000. Compute the confusion matrix and the overall fraction of correct predictions for the test data (that is, the data for observations 1001 to end of data.) (30 points)

Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression (false positive, false negative, overall correct predictions).

show a dummy variables based onCreditCard (original)data. cbind() the dummy variables with CreditCard (original) data. You can give another name if you like. (DO NOT cbind() only the variables, that will create a matrix and not a dataframe)
generate the summary () using the data created in step 1 : Part A
show CreditCard_Adultsfor age>18based on the data created in step 1 (make sure it is a dataframe) :

Part B. Try:CreditCard_Adults = subset(CreditCard, Age > 18) # There are many other ways you can create thisCreditCard_Adults.

Attach theCreditCard_Adults to R environment.
For all the following steps (C-H) useCreditCard_Adults. (this should be adata frame. check class(CreditCard_Adults) to see the type of object)
Always use the function length() to check the length a vector and use dim() to check the length of dataframe.
Use log(share) and log(reports+1) in all the regression.

Hints on Part H:

show train and test data set using the hint below:

train=CreditCard_Adults[1:1000,] test=CreditCard_Adults[1001:nrow(CreditCard_Adults),]

Fit the logistic model on the train data using theglm() function.
Compute model accuracy (similar to what you did in part G)but on test data.
Use test data in the predict() function to predict the card acceptance probability on the test observations on the basis of the predicted model in step 2 in the hint above.
To predict whether a card will be accepted (yes/no), convert the predicted probabilities into class labels "yes" or "no" on the test data.

Rememberglm.pred_test=rep("no",??)## here ?? should be number of rows in the test data

Compute the accuracy of the model on the test observations based on predicted (yes/no) and actual (yes/no) of card acceptance finally

For table () and mean(): use glm.pred_test and test.card (the response variable for the test data)## check the length of both to make sure that they match and equal to the number of rows in test data set