Question

1 Approved Answer

Posted on May 17, 2024

Background and Data Dictionary In this lab assignment, you will analyze data provided on Canvas under the file name charitydata.xls. A charitable organization has enlisted

Background and Data Dictionary

In this lab assignment, you will analyze data provided on Canvas under the file name "charitydata.xls." A charitable organization has enlisted your expertise to understand patterns and forecast potential donations from their recent marketing initiative. This dataset encompasses information on 8009 prospective donors, detailing twenty predictor variables alongside one outcome variable. The response variable, damt, represents the donation amount in dollars. Your task is to utilize this data to assist the charity in predicting the donation amounts from individuals.

Data Description

ID: Donor identifier
reg1: Donor belongs to region 1
reg2: Donor belongs to region 2
reg3: Donor belongs to region 3
reg4: Donor belongs to region 4
home: 1= homeowner, 0=not a homeowner
kids: number of children
hinc: Household income with 7 categories, 1= lowest income category, 7=Highest income category
genf: Gender (0=Male, 1=Female)
wrat: Wealth rating with 9 being the highest and 0 being the lowest
avhv: Average home value in donor's neighborhood in 1,000 USDs.
incm: Median family income in donor's neighborhood in 1,000 USDs.
inca: Average family income in donor's neighborhood in 1,000 USDs.
plow: Percentage categorize as low income in donor's neighborhood
npro: Lifetime number of promotions received to date
tgif: Dollar amount of lifetime gifts to date
lgif: Dollar amount of largest gifts to date
rgif: Dollar amount of most recent gift
tdon: Numbers of months since last denotion
tlag: Numbers of months between first and second gift
agif: Average dollar amount of gifts to date
damt: Dollar amount of donation in 1,000 USDs (Target variable)
Validation: Training= training data, Validation= Validation data (data identifier)

Task 1: Data Preprocessing and Inspection

1A: Verify the Data Types. Implement Python code to ensure the accuracy of each variable's data type. Within the Word document for Task 1, section 1A, document your process. (i.e., reg1 was originally an int64, which I converted to a categorical type; Validation was noted as an object, and I altered its type to categorical)

1B: Calculate the percentage of missing values in the charitydata dataset and identify only those variables that have missing data, along with their respective percentages of missingness. Document these findings in the Word document under Task 1, section 1B (i.e.,genf has 4% missingness and home has 44% missingness)

1C: Generate a new dataframe named traindata from the charitydata dataset, filtering for rows where Validation equals Training. Subsequently, remove the ID and Validation columns from traindata. Similarly, create another dataframe named testdata from charitydata, selecting rows where Validation equals Validation, and then eliminate theIDandValidationcolumns from "testdata". Determine the total count of rows and columns present in bothtraindataandtestdata`, and document these figures within the Word document under Task 1, section 1C (i.e. traindata has 3456 rows and 99 columns, testdata has 3245 rows and 99 columns).

1D: Separate both 'traindata' and 'testdata' into their respective predictors and target variables, labeling them as 'X_train' and 'y_train' for the training data, and 'X_test' and 'y_test' for the test data. Determine the total count of rows and columns present in both 'X_train' and 'X_test', and document these figures within the Word document under Task 1, section 1D (i.e. X_train has 3456 rows and 99 columns, X_test has 3245 rows and 99 columns).

1E: Next, utilize Scikit-learn to preprocess the data in the following manner: For both 'X_train' and 'X_test', fill in missing values using the median for numerical variables and the most common value for categorical variables. Following this, implement one-hot encoding to transform the categorical variables. Please use OneHotEncoder(drop='first') to drop the first category from each categorical feature. Save the preprocessed data as X_train_processed, and X_test_processed. Determine the total count of rows and columns present in both 'X_train_processed' and 'X_test_processed', and document these figures within the Word document under Task 1, section 1E (i.e. X_train_processed has 3456 rows and 99 columns, X_test_processed has 3245 rows and 99 columns).

Place all the code related to Task 1 within the following code block:

TASK 2

Employ the X_train_processed dataset to identify the optimal linear regression model by applying two distinct methods: forward stepwise selection and backward stepwise selection, each evaluated using Mallow's Cp criterion with 5-fold cross validation. Students can use SequentialFeatureSelector module from mlxtend package to implement subset selection, as shown in Module 4 practice lab. Set the cross validation parameter (n_splits=5) to 5 and use random_state=5410 to have comparable results.

2A: Within the Word document under Task 2, section 2A, list the features that are selected based on forward selection method.

2B: Within the Word document under Task 2, section 2B, list the features that are selected based on forward selection method.

2C: Discuss the implications for the bias-variance trade-off among the model generated through forward/backward selection methods and the full model with all potential features. Specifically, contemplate whether there's a reduction or an increase in bias and variance. Provide your answer in a brief paragraph under Task 2, section 2C in your Word Document template.

2D: Estimate the expected test error for both the forward stepwise and backward stepwise regression models previously determined in Task 2. It's important to note that this estimation must be derived exclusively from the X_train_processed dataset. Document your findings within Task 2, section 2D in your Word Document template.

2E: Estimate the expected test error for both the forward stepwise and backward stepwise regression models previously determined in Task 2. It's important to note that this estimation must be derived exclusively from the X_train_processed dataset. Document your findings within Task 2, section 2D in your Word Document template.

Place all the code related to Task 2 within the following code block:

TASK 3

In this task, we will explore whether regularization can enhance our model's performance.

Adjust the X_train_processed and X_test_processed datasets by incorporating standardization for the numerical variables within your pipeline, utilizing the StandardScaler() from Scikit-learn. Label the resulting datasets as X_train_scaled_processed and X_test_scaled_processed, respectively.

3A: Fit a ridge regression model on X_train_scaled_processed using Scikit-learn's RidgeCV, selecting the parameter that minimizes the error from 5-fold cross-validation. Choose the values from a range spanning 600 points, distributed logarithmically between 101 and 101 (i.e., lambdas=np.logspace(-1, 1, 600)). Enter the optimal lamda under Task 3, section 3A in your Word Document template.

3B: Fit a lasso regression model on X_train_scaled_processed using Scikit-learn's LassoCV, selecting the parameter that minimizes the error from 5-fold cross-validation. Choose the values from a range spanning 1000 points, distributed logarithmically between 104 and 104 (i.e., lambdas=np.logspace(-4, 4, 1000)). Use random_state=5410 to get comparable results. Enter the optimal lamda under Task 3, section 3B in your Word Document template.

3C: Identify the features selected via Lasso regression by listing the variables that have non-zero coefficient estimates from the trained model. Document these variables in your Word Document template under Task 3, section 3C.

3D: Fit an ElasticNet regression model on X_train_scaled_processed using Scikit-learn's ElasticNetCV, selecting the parameter that minimizes the error from 5-fold cross-validation. Choose the values from a range spanning 600 points, distributed logarithmically between 101 and 101 (i.e., lambdas=np.logspace(-1, 1, 600)). Also, explore a finer mix between L1 and L2 regularization by using the following ranges: l1_ratios = np.linspace(0.1, 0.9, 9). Enter the optimal lamda and pptimal l1_ratio under Task 3, section 3D in your Word Document template.

Place all the code related to Task 3 within the following code block:

TASK 4

4A: Now it's time to assess our models. As a benchmark, take the full linear regression model which takes all predictors as input, trained on the X_trained_processed dataset. Then calculate the Mean Squared Prediction Error (MSPE) on the test data and enter your finding under Task 4, section 4A.

4B: Calculate the Mean Squared Prediction Error (MSPE) on the test data for the model identified through forward selection in Task 2. Document your findings under Task 4, section 4B.

4C: Calculate the Mean Squared Prediction Error (MSPE) on the test data for the model identified through backward selection in Task 2. Document your findings under Task 4, section 4C.

4D: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda value obtained from Ridge Regression in Task 4. Record your analysis in Task 4, section 4D.

4E: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda value obtained from Ridge Regression in Task 4. Record your analysis in Task 4, section 4D.

4F: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda value obtained from Lasso Regression in Task 4. Record your analysis in Task 4, section 4F.

4G: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda and the l1_ratio valuea obtained from ElasticNet Regression in Task 4. Record your analysis in Task 4, section 4G.

4H: Evaluate which model performed well on the test dataset. Are the differences in MSPEs significant? As a data scientist aiming to maximize donations for the company, decide which of the models explored in this lab you would recommend for production use. Provide detailed justification for your choice to fully address this question.

Place all the code related to Task 4 within the following code block:

ID	reg1	reg2	reg3	reg4	home	kids	hinc	genf	wrat	avhv	incm	inca	plow	npro	tgif	lgif	rgif	tdon	tlag	agif	donr	damt	Validation
1	0	0	1	0	1	1	4	1	8	302	76	82	0	20	81	81	19	17	6	21.05	0	0	Training
2	0	0	1	0	1	2	4	0	8	262	130	130	1	95	156	16	17	19	3	13.26	1	15	Training
5	0	0	1	0	1	0	4	1	4	295	39	71	14	85	132	15	10	10	6	12.07	1	17	Validation
6	0	1	0	0	1	1	5	0	9	114	17	25	44	83	131	5	3	13	4	4.12	1	12	Training
7	0	0	0	0	1	3	4	0	8	145	39	42	10	50	74	6	5	22	3	6.5	0	0	Training
8	0	0	0	0	1	3	2	0	5	165	34	35	19	11	41	4	2	20	7	3.45	0	0	Training
10	0	0	0	0	1	3	4	1	7	200	38	58	5	42	63	12	10	19	3	9.42	0	0	Training
11	0	0	1	0	1	3	2	1	8	152	46	46	20	100	414	25	14	39	7	10.12	0	0	Training
12	0	0	0	1	1	3	4	1	6	272	69	69	0	98	169	29	36	23	7	8.97	1	17	Training
13	0	1	0	0	1	0	4	0	9	207	54	54	14	13	34	9	7	19	11	6.28	1	12	Training
14	0	0	0	1	1	0	4	0	8		21	36	32	54	117	5	4	15	9	5.11	1	15	Validation
15	0	0	1	0	1	0	5	1	8	196	57	82	1	16	71	36	20	22	8	13.44	1	18	Training
17	1	0	0	0	1	1	4	0	9	138	19	41	25	92	201	12	10	8	7	11.15	1	13	Training
18	0	1	0	0	1	0	4	0	9	200	44	48	5	58	61	23	13	15	4	12.84	1	14	Validation
19	1	0	0	0	1	0	4	1	1	278	79	83	0	87	89	29	16	16	6	8.14	1	12	Training
21	0	0	1	0	1	0	4	1	9	158	26	33	16	78	117	21	6	19	8	8.68	1	15	Training
24	0	0	0	1	1	4	6	1	5	141	32	37	21	65	79	5	5	21	6	3.69	0	0	Training
25	0	1	0	0	1	2	4	0	6	142	19	40	21	39	94	12	13	19	3	6.94	1	14	Training
26	0	0	0	0	1	2	3	1	1	224	55	66	6	134	437	14	14	19	8	9.37	0	0	Training
30	1	0	0	0	1	1	5	1	9	282	69	87	0	62	70	16	11	22	10	11.61	1	18	Training
32	1	0	0	0	1	2	4	1	6	213	70	70	6	59	104	17	11	20	9	12.84	1	15	Validation
33	0	0	1	0	0	1	4	0	2	128	29	35	15	69	78	60	47	18	4	23.81	0	0	Training
34	0	1	0	0	1	2	5	0	8	263	40	67	4	76	105	24	21	13	10	19.43	0	0	Training
35	1	0	0	0	1	2	7	0	7	119	23	57	19	39	74	12	13	18	10	11.77	0	0	Training
38	1	0	0	0	1	1	4	0	9	204	79	79	2	59	94	9	8	13	3	5.2	1	13	Training
39	0	0	1	0	0	0	2	1	9	61	4	18	73	25	69	15	9	14	10	6.85	0	0	Training
40	0	1	0	0	1	2	3	1	2	93	29	31	18	94	136	9	4	38	7	5.19	0	0	Training
42	1	0	0	0	1	3	5	0	9	174	40	47	11	68	99	16	8	17	10	10.5	0	0	Training
43	0	0	1	0	1	1	4	1	7	179	59	59	3	73	146	12	4	17	6	10.44	1	15	Training
44	0	1	0	0	1	0	3	1	6	284	43	82	4	14	40	14	7	12	5	9.8	1	15	Training
45	0	0	1	0	0	3	5	1	0	209	66	66	6	62	149	68	20	15	7	23.31	0	0	Training
46	1	0	0	0	1	2	4	1	3	163	36	44	9	50	74	9	5	16	11	6.3	0	0	Training
48	0	0	0	0	1	1	4	0	8	124	20	43	26	43	75	17	13	28	6	14.51	0	0	Training
51	0	0	0	0	1	0	7	0	8	324	75	78	0	90	177	6	7	21	5	8.04	1	14	Training
52	0	1	0	0	1	2	3	0	9	167	73	73	2	34	63	10	6	20	5	8.77	1	11	Training
53	0	1	0	0	1	3	4	1	9	94	30	30	28	72	133	52	41	16	6	14.39	1	14	Training
54	0	1	0	0	1	4	4	2	9	231	55	58	3	66	112	20	16	15	6	19.18	1	13	Training
55	0	0	0	1	1	0	4	1	8	158	38	38	22	55	80	20	20	14	3	10.22	1	16	Training
56	0	0	0	0	1	2	5	1	6	427	78	114	0	78	131	9	4	24	6	6.13	1	13	Training
57	1	0	0	0	1	1	4	1	7	124	41	41	12	24	56	53	56	13	4	22.5	1	17	Training
58	0	0	0	0	1	3	4	1	8	172	34	55	15	62	73	24	21	19	7	12.32	0	0	Validation