21 3 TAYKO SOFTWARE CATALOGER Data Link https 1drv ms x s Ao duWhjG7s9hC2CouQBXkre5Gk2 e fi1FOh Develop a logistic regression model for classifying a customer as a purchaser or non purchaser Partition the data randomly into training set 60 validation set 40 Run logistic regression with L2 penalty, using method LogisticRegressionCV Please submit Python code Tell a high level story of steps taken to get to the end result Start with the framework i e , objective, exploration, variable selection (PCA, Correlation etc ) Then provide the final results and comparison analysis of the training vs validation data vs test Present your findings in power point format (no more than 5 slides) in terms of steps taken and results Things you can add Show the shape of the df Show some records of the df List data types of the variables in the df Preliminary Exploration view the data rename all column names replace space with underscore Look at descriptive statistics Count of Missing values Remove certain variables from the onset (i e , spending and sequence number) Count number of unique values in each variable Dummy variables if need Some visualizations to explore the data Histograms, Frequency Distribution, side by side plots with the outcome, scatterplot, pairplot Other plots according to your discretion Correlation table Heatmap Comment on high correlations Conduct a PCA Discuss how many PCs to use The Logistic Regression (don't incorporate PCA or variable reduction through correlations here, run the logistic regression on all variables apart from spending and sequence number) Partition the data on the whole data set randomly into a training set 60 validation set 40 Run quick descriptive stats for validation and training dataset Fit a logistic regression (set penalty l2 and C 1e42 to avoid regularization) Predict the model on validation dataset Develop gains and lift chart for test and validation results Confusion matrix for all sets Show some use of stats model if possible

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on May 19, 2024

21.3 TAYKO SOFTWARE CATALOGER

Data Link - https://1drv.ms/x/s!Ao_duWhjG7s9hC2CouQBXkre5Gk2?e=fi1FOh

Develop a logistic regression model for classifying a customer as a purchaser or non-purchaser. Partition the data randomly into training set 60% validation set 40%. Run logistic regression with L2 penalty, using method LogisticRegressionCV. Please submit Python code.
Tell a high-level story of steps taken to get to the end result. Start with the framework i.e., objective, exploration, variable selection (PCA, Correlation etc.). Then provide the final results and comparison analysis of the training vs. validation data vs. test.
Present your findings in power point format (no more than 5 slides) in terms of steps taken and results.

Things you can add:

Show the shape of the df
Show some records of the df
List data types of the variables in the df
Preliminary Exploration - view the data: rename all column names - replace space with underscore
Look at descriptive statistics
Count of Missing values
Remove certain variables from the onset (i.e., spending and sequence number)
Count number of unique values in each variable
Dummy variables if need
Some visualizations to explore the data
- Histograms, Frequency Distribution, side by side plots with the outcome, scatterplot, pairplot
- Other plots according to your discretion
Correlation table & Heatmap: Comment on high correlations
Conduct a PCA: Discuss how many PCs to use

The Logistic Regression:

(don't incorporate PCA or variable reduction through correlations here, run the logistic regression on all variables apart from spending and sequence_number)
Partition the data on the whole data set randomly into a training set 60% validation set 40%
Run quick descriptive stats for validation and training dataset
Fit a logistic regression (set penalty=l2 and C=1e42 to avoid regularization): Predict the model on validation dataset
Develop gains and lift chart for test and validation results
Confusion matrix for all sets
Show some use of stats model if possible