Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

NEED HELP WITH #4 D & E ##### IS 470 Homework 2----------------------------------------------------------- ### ------------------------------------------------------------------------------- #Banks can generate significant profits from term deposits such as a

NEED HELP WITH #4 D & E

##### IS 470 Homework 2----------------------------------------------------------- ### -------------------------------------------------------------------------------

#Banks can generate significant profits from term deposits such as a certificate of deposit (CD).

#These deposits are required to be held for a certain period of time, which gives the bank access #to those funds for lending purposes at a higher rate than the rate paid for the deposit. Of course, marketing term deposit products to customers can be expensive, so the bank will want to focus their efforts on those customers most likely to buy these products. In this data set, we have information about 45,211 customers, including demographic information as well as data related to their prior experience with the bank and previous marketing campaigns. Additionally, we have a class variable "y" that indicates whether this customer purchased a term product in the current marketing campaign. Our objective is to predict which customers will purchase a term product if we spend the money to advertise to them. We want to develop a model that will maximize the returns based on the costs of marketing and the benefits of customer purchase. This data was from a paper published by Moro et al. (S. Moro, P. Cortez and P. Rita. A Data-Driven

#Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014)

'VARIABLE DESCRIPTIONS:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (categorical: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (categorical: "yes","no")

8 - loan: has personal loan? (categorical: "yes","no")

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Target variable:

17 - y: has the client subscribed a term deposit? (binary: "yes","no") The target classification (output) column is y. All other columns are potential predictors.'

### ------------------------------------------------------------------------------

### 1. Import and clean data (10 points)

# A. Import data. Load character variable as character strings first (stringsAsFactors = FALSE).

bank <- read.csv(file = "bank.csv", stringsAsFactors = FALSE)

# B. Show the overall structure and summary of the input data.

str(bank) summary(bank)

# C. Transform categorical variables (10 categorical variables) to factors.

Show the overall structure and summary of the data again.

bank$job <- factor(bank$job)

bank$marital <- factor(bank$marital)

bank$education <- factor(bank$education)

bank$default <- factor(bank$default)

bank$housing <- factor(bank$housing)

bank$loan <- factor(bank$loan)

bank$contact <- factor(bank$contact)

bank$month <- factor(bank$month)

bank$poutcome <- factor(bank$poutcome)

bank$y <- factor(bank$y)

str(bank)

summary(bank)

# D. Explore categorical variables, and answer the following questions.

# 1) Show the summary of target variable.

summary(bank$y)

# 2) Show the summary of housing variable, how many customers have housing loans?

summary(bank$housing)

summary(bank$loan == 'yes')

#According to the summary, 7244 customers have housing loans.

# 3) Show the summary of job variable, how many customers are retired?

summary(bank$job)

summary(bank$job == 'retired')

#According to the summary, 2264 people are retired.

# E. Explore numeric variables, and answer the following questions.

# 1) Create a histogram of the balance. Is the distribution skewed?

hist(bank$balance, main="Histogram of Balance in the bank data set", xlab="balance")

#Yes it is

# 2) Create a correlation table for all of the numeric values in the data set.

Which two variables have the highest correlation?

cor(bank[,c("age", "balance", "day", "duration", "campaign", "pdays", "previous")])

cor(bank[,c(1,6,10,12:15)])

#pdays and previous have the highest correlation

# 3) Show a boxplot of duration by variable y.

boxplot(duration~y, data = bank)

### 2. Data partitioning and inspection code (5 points)

# A. Partition the data set for simple hold-out evaluation - 70% for training and the other 30% for testing.

library(caret)

set.seed(1)

inTrain <- createDataPartition(bank$y, p=0.7, list=FALSE)

datTrain <- bank[inTrain,]

datTest <- bank[-inTrain,]

# B. Show the overall structure and summary of train and test sets.

str(datTrain)

summary(datTrain)

str(datTest)

summary(datTest)

prop.table(table(bank$y))

prop.table(table(datTrain$y))

prop.table(table(datTest$y))

### 3. Classification model training and testing. (15 points)

# A. Train a decision model using rpart package, set cp = 0.0001, maxdepth = 7. (sample code in lab4)

# And generate this model's confusion matrices and classification evaluation metrics on training and testing sets.

library(rpart)

library(rpart.plot)

library(rminer)

rpart_model <- rpart(y~.,data = datTrain,control = rpart.control(cp = 0.0001, maxdepth = 5))

rpart.plot(rpart_model)

rpart_model

# B. Train a Naive Bayes model with laplace = 1.

# And generate this model's confusion matrices and classification evaluation metrics on training and testing sets.

library(e1071)

model <- naiveBayes(y~.,data=datTrain, laplace = 1)

model

prediction_on_train <- predict(model, datTrain)

prediction_on_test <- predict(model, datTest) mmetric(datTrain$y,prediction_on_train, metric="CONF")

mmetric(datTest$y,prediction_on_test, metric="CONF")

mmetric(datTrain$y,prediction_on_train,metric=c("ACC","PRECISION","TPR","F1")) mmetric(datTest$y,prediction_on_test,metric=c("ACC","PRECISION","TPR","F1"))

# C. Which model (decision tree or naive bayes) has better performance? And why?

#Naive Bayes model has a better performance because it had high accuracy, precision and F- measures. ### 4. Cost-benefit analysis. (20 points) # A. Assume the following costs / benefits:

# a. Cost of marketing to a customer: $30 per customer receiving marketing.

# b. Average Bank income for a purchase: $500 per customer that purchases term deposit.

# c. Opportunity cost of person not marketed but who would have been a purchaser: $500 # Based on the above costs / benefits, what is the total net benefit / cost of the two models (decision tree and naive bayes) on testing data?

#Total Net benefit = ($500 + $500) - $50 = $950

# B. Create a cost matrix that sets the cost of classifying a purchaser of a term deposit as a non-purchaser to be 5 times the cost of the opposite (that is classifying a non-purchaser as a purchaser). matrix_dimensions <- list(c("No", "Yes"),c("predict_No", "Predict_Yes")) costMatrix<-matrix(c(0,5,1,0),nrow=2,dimnames = matrix_dimensions) print(costMatrix)

# C. Use this cost matrix to build a decision tree model with rpart package, and set cp = 0.0001, maxdepth = 7.

# And generate this model's confusion matrices and classification evaluation metrics in training and testing sets. library(rpart) library(rpart.plot) library(rminer) rpart_model2 <- rpart(y~.,data = datTrain,control = rpart.control(cp = 0.0001, maxdepth = 5), parms = list(loss = costMatrix)) rpart.plot(rpart_model2) rpart_model2 prediction_on_train2 <- predict(rpart_model2, datTrain, type = "class") prediction_on_test2 <- predict(rpart_model2, datTest, type = "class") mmetric(datTrain$y,prediction_on_train2, metric="CONF") mmetric(datTest$y,prediction_on_test2, metric="CONF") mmetric(datTrain$y,prediction_on_train2,metric=c("ACC","PRECISION","TPR","F1")) mmetric(datTest$y,prediction_on_test2,metric=c("ACC","PRECISION","TPR","F1"))

# D. Based on the costs / benefits, what is the total net benefit / cost of this new decision tree model on testing data?

# E. Compare decision tree models with or without cost matrix. If you were the manager, which model you will use, and why?

##### end-------------------------------------------------------------------------