Instructions for Q2 (16 Points) For Q2, you will use logistic regression to segment customers into two classes This question is adapted from a Kaggle contest to evaluate current customers for an auto dealership that is opening a new location The dealership wishes to reliably segment customers in order to streamline marketing efforts when their new location opens The data for these questions is contained in the file, auto customers two segments csv , which includes the following fields ID Unique ID created by dealership (Removed from data Q2) Gender Gender of the customer Ever Married Marital status of the customer Age Age of the customer Graduated Is the customer a college graduate Profession Profession of the customer Work Experience Work experience in years Spending Score Spending score of the customer Family Size Number of family members (including customer) Segmentation The marketing segment the customer is in In 6 Load necessary library and data suppressPackageStartupMessages( library (tidyverse)) suppressPackageStartupMessages( library (knitr)) library (caret) load data from file data Q2 read csv(' resource asnlib publicdata auto customers two segments csv') Loading required package lattice Attaching package 'caret' The following object is masked from 'package purrr' lift A Follow the steps below to clean the data set and answer the question (2 Points) Remove all rows with n a values Remove all rows with data points that are blank ( these are values that have a string length of 0) You can use dplyr or other methods to complete this step ( HINT the cleaned sample size is 3400) What are the customer segment labels ( Segmentation column) and how many customers are in each segment In 7 SOLUTION BEGINS HERE Remove n a values Remove all rows with '' values SOLUTION ENDS HERE B Show a logistic regression model using Segmentation as the response variable and all other variables as the predictors (2 Points) Segment A represents the most desirable customers Segment B represents customers who have made purchases, but are not as likely to have a high or average spending score HINT You need to specify Segment B as a reference level by setting Segment B to 0 and Segment A to 1 In 8 SOLUTION BEGINS HERE SOLUTION ENDS HERE C Use the cleaned data set for both training and testing data and make predictions of the probability for each customer (2 Points) We are using the same data here for testing and training only for simplicity of the problem You need to display the first 5 values HINT You may use the predict() function on the model using the data to create a prediction vector of probabilities In 9 SOLUTION BEGINS HERE SOLUTION ENDS HERE D Predict the accuracy of the model for each cutoff probability value (2 Points) Cutoff Probability Values (0 2, 0 3, 0 4, 0 5, 0 6, 0 7, 0 8) Classify each predicted value as 1 (positive success) or 0 (negative failure) using the cutoff probability value as a boundary Compare predicted class to expected class for each data point to calculate accuracy Accuracy (TP TN) (TP TN FP FN) Round to three decimal places (e g , 0 456) You will have 7 calculations for accuracy here One for each cutoff probability value NOTE The cutoff probability value is NOT the same as the p value when evaluating the significance of coefficients in the logistic regression model The cutoff probability value is only used to classify predictions In 10 SOLUTION BEGINS HERE SOLUTION ENDS HERE E Which cutoff probability value results in the highest accuracy for the model What is the accuracy as a percentage (2 Points) Round to 1 decimal place (e g , 80 5 ) In 11 SOLUTION BEGINS HERE SOLUTION ENDS HERE F Now we have created a model and found the model that produces the highest accuracy Show a confusion matrix using predictions from the optimal cutoff probability value above (2 Points) HINT You may use the caret package function, confusionMatrix() to automate this process Make sure to confirm that your 'positive' class matches your 'positive' class for the model In 12 SOLUTION BEGINS HERE Show a predicted probability vector for optimized p Show a results data frame with expected and predicted vectors Show confusion matrix SOLUTION ENDS HERE G Report the Accuracy, Sensitivity and Specificity for the chosen model as percentages (2 Points) Round to 1 decimal place (e g , 80 5 ) In 13 SOLUTION BEGINS HERE SOLUTION ENDS HERE H Evaluate your model (2 Points) You have shared your results with the auto dealership's leadership team They are concerned that too many customers in segment A are classified as segment B (Why is this a problem ) How would you adjust your model to capture more of the misclassified segment A customers Are there any downsides to adjusting your model this way Short answer only, you DON'T need to produce any more models You may find it helpful to run different models to confirm your answer In 14 SOLUTION BEGINS HERE SOLUTION ENDS HERE

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on May 19, 2024

Instructions for Q2 (16 Points) For Q2, you will use logistic regression to segment customers into two classes. This question is adapted from a Kaggle

Instructions for Q2 (16 Points)

For Q2, you will use logistic regression to segment customers into two classes. This question is adapted from a Kaggle contest to evaluate current customers for an auto dealership that is opening a new location. The dealership wishes to reliably segment customers in order to streamline marketing efforts when their new location opens. The data for these questions is contained in the file, auto_customers_two_segments.csv, which includes the following fields:

ID : Unique ID created by dealership (Removed from data_Q2)
Gender : Gender of the customer
Ever_Married : Marital status of the customer
Age : Age of the customer
Graduated : Is the customer a college graduate?
Profession : Profession of the customer
Work_Experience : Work experience in years
Spending_Score : Spending score of the customer.
Family_Size : Number of family members (including customer)
Segmentation : The marketing segment the customer is in.

In[6]:

# Load necessary library and data suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(knitr)) library(caret) #load data from file data_Q2 = read.csv('../resource/asnlib/publicdata/auto_customers_two_segments.csv')

Loading required package: lattice Attaching package: 'caret' The following object is masked from 'package:purrr': lift

A. Follow the steps below to clean the data set and answer the question. (2 Points)

Remove all rows with n/a values.
Remove all rows with data points that are blank (these are "" values that have a string length of 0). You can use dplyr or other methods to complete this step. (HINT: the cleaned sample size is 3400)
What are the customer segment labels (Segmentation column) and how many customers are in each segment?

In[7]:

# SOLUTION BEGINS HERE

# Remove n/a values # Remove all rows with '' values

# SOLUTION ENDS HERE

B. Show a logistic regression model using Segmentation as the response variable and all other variables as the predictors. (2 Points)

Segment A represents the most desirable customers
Segment B represents customers who have made purchases, but are not as likely to have a high or average spending score.
HINT: You need to specify Segment B as a reference level by setting Segment B to 0 and Segment A to 1.

In[8]:

# SOLUTION BEGINS HERE

# SOLUTION ENDS HERE

C. Use the cleaned data set for both training and testing data and make predictions of the probability for each customer. (2 Points)

We are using the same data here for testing and training only for simplicity of the problem.
You need to display the first 5 values.
HINT: You may use the predict() function on the model using the data to create a prediction vector of probabilities.

In[9]:

# SOLUTION BEGINS HERE

# SOLUTION ENDS HERE

D. Predict the accuracy of the model for each cutoff probability value. (2 Points)

Cutoff Probability Values : (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
Classify each predicted value as 1 (positive/success) or 0 (negative/failure) using the cutoff probability value as a boundary.
Compare predicted class to expected class for each data point to calculate accuracy

: Accuracy=(TP+TN) / (TP+TN+FP+FN)

Round to three decimal places (e.g., 0.456).
You will have 7 calculations for accuracy here. One for each cutoff probability value.
NOTE : The cutoff probability value is NOT the same as the p-value when evaluating the significance of coefficients in the logistic regression model. The cutoff probability value is only used to classify predictions.

In[10]:

# SOLUTION BEGINS HERE

# SOLUTION ENDS HERE

E. Which cutoff probability value results in the highest accuracy for the model? What is the accuracy as a percentage? (2 Points)

Round to 1 decimal place (e.g., 80.5%).

In[11]:

# SOLUTION BEGINS HERE

# SOLUTION ENDS HERE

F. Now we have created a model and found the model that produces the highest accuracy. Show a confusion matrix using predictions from the optimal cutoff probability value above. (2 Points)

HINT: You may use the caret package function, confusionMatrix() to automate this process. Make sure to confirm that your 'positive' class matches your 'positive' class for the model.

In[12]:

# SOLUTION BEGINS HERE

# Show a predicted probability vector for optimized p

# Show a results data frame with expected and predicted vectors

# Show confusion matrix # SOLUTION ENDS HERE

G. Report the Accuracy, Sensitivity and Specificity for the chosen model as percentages. (2 Points)

Round to 1 decimal place (e.g., 80.5%).

In[13]:

# SOLUTION BEGINS HERE

# SOLUTION ENDS HERE

H. Evaluate your model. (2 Points)

You have shared your results with the auto dealership's leadership team.
They are concerned that too many customers in segment A are classified as segment B (Why is this a problem?).
How would you adjust your model to capture more of the misclassified segment A customers?
Are there any downsides to adjusting your model this way?
Short answer only, you DON'T need to produce any more models. You may find it helpful to run different models to confirm your answer.

In[14]:

# SOLUTION BEGINS HERE

# SOLUTION ENDS HERE