Question
Instructions for Q2 (16 Points) For Q2, you will use logistic regression to segment customers into two classes. This question is adapted from a Kaggle
Instructions for Q2 (16 Points)
For Q2, you will use logistic regression to segment customers into two classes. This question is adapted from a Kaggle contest to evaluate current customers for an auto dealership that is opening a new location. The dealership wishes to reliably segment customers in order to streamline marketing efforts when their new location opens. The data for these questions is contained in the file, auto_customers_two_segments.csv, which includes the following fields:
- ID : Unique ID created by dealership (Removed from data_Q2)
- Gender : Gender of the customer
- Ever_Married : Marital status of the customer
- Age : Age of the customer
- Graduated : Is the customer a college graduate?
- Profession : Profession of the customer
- Work_Experience : Work experience in years
- Spending_Score : Spending score of the customer.
- Family_Size : Number of family members (including customer)
- Segmentation : The marketing segment the customer is in.
In[6]:
# Load necessary library and data suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(knitr)) library(caret) #load data from file data_Q2 = read.csv('../resource/asnlib/publicdata/auto_customers_two_segments.csv')
Loading required package: lattice Attaching package: 'caret' The following object is masked from 'package:purrr': lift
A. Follow the steps below to clean the data set and answer the question. (2 Points)
- Remove all rows with n/a values.
- Remove all rows with data points that are blank (these are "" values that have a string length of 0). You can use dplyr or other methods to complete this step. (HINT: the cleaned sample size is 3400)
- What are the customer segment labels (Segmentation column) and how many customers are in each segment?
In[7]:
# SOLUTION BEGINS HERE
# Remove n/a values # Remove all rows with '' values
# SOLUTION ENDS HERE
B. Show a logistic regression model using Segmentation as the response variable and all other variables as the predictors. (2 Points)
- Segment A represents the most desirable customers
- Segment B represents customers who have made purchases, but are not as likely to have a high or average spending score.
- HINT: You need to specify Segment B as a reference level by setting Segment B to 0 and Segment A to 1.
In[8]:
# SOLUTION BEGINS HERE
# SOLUTION ENDS HERE
C. Use the cleaned data set for both training and testing data and make predictions of the probability for each customer. (2 Points)
- We are using the same data here for testing and training only for simplicity of the problem.
- You need to display the first 5 values.
- HINT: You may use the predict() function on the model using the data to create a prediction vector of probabilities.
In[9]:
# SOLUTION BEGINS HERE
# SOLUTION ENDS HERE
D. Predict the accuracy of the model for each cutoff probability value. (2 Points)
- Cutoff Probability Values : (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
- Classify each predicted value as 1 (positive/success) or 0 (negative/failure) using the cutoff probability value as a boundary.
- Compare predicted class to expected class for each data point to calculate accuracy
: Accuracy=(TP+TN) / (TP+TN+FP+FN)
- Round to three decimal places (e.g., 0.456).
- You will have 7 calculations for accuracy here. One for each cutoff probability value.
- NOTE : The cutoff probability value is NOT the same as the p-value when evaluating the significance of coefficients in the logistic regression model. The cutoff probability value is only used to classify predictions.
In[10]:
# SOLUTION BEGINS HERE
# SOLUTION ENDS HERE
E. Which cutoff probability value results in the highest accuracy for the model? What is the accuracy as a percentage? (2 Points)
- Round to 1 decimal place (e.g., 80.5%).
In[11]:
# SOLUTION BEGINS HERE
# SOLUTION ENDS HERE
F. Now we have created a model and found the model that produces the highest accuracy. Show a confusion matrix using predictions from the optimal cutoff probability value above. (2 Points)
- HINT: You may use the caret package function, confusionMatrix() to automate this process. Make sure to confirm that your 'positive' class matches your 'positive' class for the model.
In[12]:
# SOLUTION BEGINS HERE
# Show a predicted probability vector for optimized p
# Show a results data frame with expected and predicted vectors
# Show confusion matrix # SOLUTION ENDS HERE
G. Report the Accuracy, Sensitivity and Specificity for the chosen model as percentages. (2 Points)
- Round to 1 decimal place (e.g., 80.5%).
In[13]:
# SOLUTION BEGINS HERE
# SOLUTION ENDS HERE
H. Evaluate your model. (2 Points)
- You have shared your results with the auto dealership's leadership team.
- They are concerned that too many customers in segment A are classified as segment B (Why is this a problem?).
- How would you adjust your model to capture more of the misclassified segment A customers?
- Are there any downsides to adjusting your model this way?
- Short answer only, you DON'T need to produce any more models. You may find it helpful to run different models to confirm your answer.
In[14]:
# SOLUTION BEGINS HERE
# SOLUTION ENDS HERE
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started