Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jun 25, 2024

The dataset UniversalBank.csv below contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage,

The dataset UniversalBank.csv below contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (PersonalLoan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Partition the dataset into 60% training and 40% validation sets considering the information on the following customer:

Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1,Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card=1

Second part of the problem

Consider the following customer:

Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1,Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1 and Credit Card= 1.

Classify the above customer using the best k.

Repartition the data, this time into training, validation, and test sets (50% : 30% : 20%).

Apply the k-NN method with the k chosen above.

Compare the confusion matrix of the test set with that of the training and validation sets.

Comment on the differences and their reason

dataset and my current codes (some are not working)

dataset- https://github.com/MyGitHub2120/Personal-Loan-Acceptance

Here are my codes

library("dplyr")

library("tidyr")

library("ggplot2")

library("rpart")

library("rpart.plot")

library("caret")

library("randomForest")

library("tidyverse")

library("glmnet")

library("Hmisc")

library("dummies")

library('tinytex')

library('GGally')

library('gplots')

library("dplyr")

library("tidyr")

library("caTools")

library("reshape")

df<-read_csv("C:/Users/andyt/OneDrive/Desktop/UniversalBank.csv")

View(UniversalBank)

bank<-df

names(bank)

bank$Education <- as.factor(bank$Education)

bank_dummy<-dummy.data.frame(select(bank,-c(Zip.Code,ID))) Could not categorize the variable 'Zip Code' Need to resolve this issue for the next code

bank_dummy$Personal.Loan = as.factor(bank_dummy$Personal.Loan)

bank_dummy$CCAvg = as.integer(bank_dummy$CCAvg)

set.seed(1)

train.index <- sample(row.names(bank_dummy), 0.6*dim(bank_dummy)[1])## need to look at hints

test.index <- setdiff(row.names(bank_dummy), train.index)

train.df <- bank_dummy[train.index, ]

valid.df <- bank_dummy[test.index, ]

new.df = data.frame(Age = as.integer(40), Experience = as.integer(10), Income = as.integer(84), Family = as.integer(2), CCAvg = as.integer(2), Education1 = as.integer(0), Education2 = as.integer(1), Education3 = as.integer(0), Mortgage = as.integer(0), Securities.Account = as.integer(0), CD.Account = as.integer(0), Online = as.integer(1), CreditCard = as.integer(1))

norm.values <- preProcess(train.df[, -c(10)], method=c("center", "scale"))

train.df[, -c(10)] <- predict(norm.values, train.df[, -c(10)])

valid.df[, -c(10)] <- predict(norm.values, valid.df[, -c(10)])

new.df <- predict(norm.values, new.df)

knn.1 <- knn(train = train.df[,-c(10)],test = new.df, cl = train.df[,10], k=5, prob=TRUE)

knn.attributes <- attributes(knn.1)