Question
Background and Data Dictionary In this lab assignment, you will analyze data provided on Canvas under the file name charitydata.xls. A charitable organization has enlisted
Background and Data Dictionary
In this lab assignment, you will analyze data provided on Canvas under the file name "charitydata.xls." A charitable organization has enlisted your expertise to understand patterns and forecast potential donations from their recent marketing initiative. This dataset encompasses information on 8009 prospective donors, detailing twenty predictor variables alongside one outcome variable. The response variable, damt, represents the donation amount in dollars. Your task is to utilize this data to assist the charity in predicting the donation amounts from individuals.
Data Description
- ID: Donor identifier
- reg1: Donor belongs to region 1
- reg2: Donor belongs to region 2
- reg3: Donor belongs to region 3
- reg4: Donor belongs to region 4
- home: 1= homeowner, 0=not a homeowner
- kids: number of children
- hinc: Household income with 7 categories, 1= lowest income category, 7=Highest income category
- genf: Gender (0=Male, 1=Female)
- wrat: Wealth rating with 9 being the highest and 0 being the lowest
- avhv: Average home value in donor's neighborhood in 1,000 USDs.
- incm: Median family income in donor's neighborhood in 1,000 USDs.
- inca: Average family income in donor's neighborhood in 1,000 USDs.
- plow: Percentage categorize as low income in donor's neighborhood
- npro: Lifetime number of promotions received to date
- tgif: Dollar amount of lifetime gifts to date
- lgif: Dollar amount of largest gifts to date
- rgif: Dollar amount of most recent gift
- tdon: Numbers of months since last denotion
- tlag: Numbers of months between first and second gift
- agif: Average dollar amount of gifts to date
- damt: Dollar amount of donation in 1,000 USDs (Target variable)
- Validation: Training= training data, Validation= Validation data (data identifier)
Task 1: Data Preprocessing and Inspection
1A: Verify the Data Types. Implement Python code to ensure the accuracy of each variable's data type. Within the Word document for Task 1, section 1A, document your process. (i.e., reg1 was originally an int64, which I converted to a categorical type; Validation was noted as an object, and I altered its type to categorical)
1B: Calculate the percentage of missing values in the charitydata dataset and identify only those variables that have missing data, along with their respective percentages of missingness. Document these findings in the Word document under Task 1, section 1B (i.e.,genf has 4% missingness and home has 44% missingness)
1C: Generate a new dataframe named traindata from the charitydata dataset, filtering for rows where Validation equals Training. Subsequently, remove the ID and Validation columns from traindata. Similarly, create another dataframe named testdata from charitydata, selecting rows where Validation equals Validation, and then eliminate theIDandValidationcolumns from "testdata". Determine the total count of rows and columns present in bothtraindataandtestdata`, and document these figures within the Word document under Task 1, section 1C (i.e. traindata has 3456 rows and 99 columns, testdata has 3245 rows and 99 columns).
1D: Separate both 'traindata' and 'testdata' into their respective predictors and target variables, labeling them as 'X_train' and 'y_train' for the training data, and 'X_test' and 'y_test' for the test data. Determine the total count of rows and columns present in both 'X_train' and 'X_test', and document these figures within the Word document under Task 1, section 1D (i.e. X_train has 3456 rows and 99 columns, X_test has 3245 rows and 99 columns).
- 1E: Next, utilize Scikit-learn to preprocess the data in the following manner: For both 'X_train' and 'X_test', fill in missing values using the median for numerical variables and the most common value for categorical variables. Following this, implement one-hot encoding to transform the categorical variables. Please use OneHotEncoder(drop='first') to drop the first category from each categorical feature. Save the preprocessed data as X_train_processed, and X_test_processed. Determine the total count of rows and columns present in both 'X_train_processed' and 'X_test_processed', and document these figures within the Word document under Task 1, section 1E (i.e. X_train_processed has 3456 rows and 99 columns, X_test_processed has 3245 rows and 99 columns).
Place all the code related to Task 1 within the following code block:
TASK 2
Employ the X_train_processed dataset to identify the optimal linear regression model by applying two distinct methods: forward stepwise selection and backward stepwise selection, each evaluated using Mallow's Cp criterion with 5-fold cross validation. Students can use SequentialFeatureSelector module from mlxtend package to implement subset selection, as shown in Module 4 practice lab. Set the cross validation parameter (n_splits=5) to 5 and use random_state=5410 to have comparable results.
2A: Within the Word document under Task 2, section 2A, list the features that are selected based on forward selection method.
2B: Within the Word document under Task 2, section 2B, list the features that are selected based on forward selection method.
2C: Discuss the implications for the bias-variance trade-off among the model generated through forward/backward selection methods and the full model with all potential features. Specifically, contemplate whether there's a reduction or an increase in bias and variance. Provide your answer in a brief paragraph under Task 2, section 2C in your Word Document template.
2D: Estimate the expected test error for both the forward stepwise and backward stepwise regression models previously determined in Task 2. It's important to note that this estimation must be derived exclusively from the X_train_processed dataset. Document your findings within Task 2, section 2D in your Word Document template.
- 2E: Estimate the expected test error for both the forward stepwise and backward stepwise regression models previously determined in Task 2. It's important to note that this estimation must be derived exclusively from the X_train_processed dataset. Document your findings within Task 2, section 2D in your Word Document template.
Place all the code related to Task 2 within the following code block:
TASK 3
In this task, we will explore whether regularization can enhance our model's performance.
Adjust the X_train_processed and X_test_processed datasets by incorporating standardization for the numerical variables within your pipeline, utilizing the StandardScaler() from Scikit-learn. Label the resulting datasets as X_train_scaled_processed and X_test_scaled_processed, respectively.
3A: Fit a ridge regression model on X_train_scaled_processed using Scikit-learn's RidgeCV, selecting the parameter that minimizes the error from 5-fold cross-validation. Choose the values from a range spanning 600 points, distributed logarithmically between 101 and 101 (i.e., lambdas=np.logspace(-1, 1, 600)). Enter the optimal lamda under Task 3, section 3A in your Word Document template.
3B: Fit a lasso regression model on X_train_scaled_processed using Scikit-learn's LassoCV, selecting the parameter that minimizes the error from 5-fold cross-validation. Choose the values from a range spanning 1000 points, distributed logarithmically between 104 and 104 (i.e., lambdas=np.logspace(-4, 4, 1000)). Use random_state=5410 to get comparable results. Enter the optimal lamda under Task 3, section 3B in your Word Document template.
3C: Identify the features selected via Lasso regression by listing the variables that have non-zero coefficient estimates from the trained model. Document these variables in your Word Document template under Task 3, section 3C.
3D: Fit an ElasticNet regression model on X_train_scaled_processed using Scikit-learn's ElasticNetCV, selecting the parameter that minimizes the error from 5-fold cross-validation. Choose the values from a range spanning 600 points, distributed logarithmically between 101 and 101 (i.e., lambdas=np.logspace(-1, 1, 600)). Also, explore a finer mix between L1 and L2 regularization by using the following ranges: l1_ratios = np.linspace(0.1, 0.9, 9). Enter the optimal lamda and pptimal l1_ratio under Task 3, section 3D in your Word Document template.
Place all the code related to Task 3 within the following code block:
TASK 4
- 4A: Now it's time to assess our models. As a benchmark, take the full linear regression model which takes all predictors as input, trained on the X_trained_processed dataset. Then calculate the Mean Squared Prediction Error (MSPE) on the test data and enter your finding under Task 4, section 4A.
4B: Calculate the Mean Squared Prediction Error (MSPE) on the test data for the model identified through forward selection in Task 2. Document your findings under Task 4, section 4B.
4C: Calculate the Mean Squared Prediction Error (MSPE) on the test data for the model identified through backward selection in Task 2. Document your findings under Task 4, section 4C.
4D: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda value obtained from Ridge Regression in Task 4. Record your analysis in Task 4, section 4D.
4E: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda value obtained from Ridge Regression in Task 4. Record your analysis in Task 4, section 4D.
4F: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda value obtained from Lasso Regression in Task 4. Record your analysis in Task 4, section 4F.
4G: Compute the Mean Squared Prediction Error (MSPE) for the test dataset using the model refined with the optimal lambda and the l1_ratio valuea obtained from ElasticNet Regression in Task 4. Record your analysis in Task 4, section 4G.
- 4H: Evaluate which model performed well on the test dataset. Are the differences in MSPEs significant? As a data scientist aiming to maximize donations for the company, decide which of the models explored in this lab you would recommend for production use. Provide detailed justification for your choice to fully address this question.
Place all the code related to Task 4 within the following code block:
ID | reg1 | reg2 | reg3 | reg4 | home | kids | hinc | genf | wrat | avhv | incm | inca | plow | npro | tgif | lgif | rgif | tdon | tlag | agif | donr | damt | Validation |
1 | 0 | 0 | 1 | 0 | 1 | 1 | 4 | 1 | 8 | 302 | 76 | 82 | 0 | 20 | 81 | 81 | 19 | 17 | 6 | 21.05 | 0 | 0 | Training |
2 | 0 | 0 | 1 | 0 | 1 | 2 | 4 | 0 | 8 | 262 | 130 | 130 | 1 | 95 | 156 | 16 | 17 | 19 | 3 | 13.26 | 1 | 15 | Training |
5 | 0 | 0 | 1 | 0 | 1 | 0 | 4 | 1 | 4 | 295 | 39 | 71 | 14 | 85 | 132 | 15 | 10 | 10 | 6 | 12.07 | 1 | 17 | Validation |
6 | 0 | 1 | 0 | 0 | 1 | 1 | 5 | 0 | 9 | 114 | 17 | 25 | 44 | 83 | 131 | 5 | 3 | 13 | 4 | 4.12 | 1 | 12 | Training |
7 | 0 | 0 | 0 | 0 | 1 | 3 | 4 | 0 | 8 | 145 | 39 | 42 | 10 | 50 | 74 | 6 | 5 | 22 | 3 | 6.5 | 0 | 0 | Training |
8 | 0 | 0 | 0 | 0 | 1 | 3 | 2 | 0 | 5 | 165 | 34 | 35 | 19 | 11 | 41 | 4 | 2 | 20 | 7 | 3.45 | 0 | 0 | Training |
10 | 0 | 0 | 0 | 0 | 1 | 3 | 4 | 1 | 7 | 200 | 38 | 58 | 5 | 42 | 63 | 12 | 10 | 19 | 3 | 9.42 | 0 | 0 | Training |
11 | 0 | 0 | 1 | 0 | 1 | 3 | 2 | 1 | 8 | 152 | 46 | 46 | 20 | 100 | 414 | 25 | 14 | 39 | 7 | 10.12 | 0 | 0 | Training |
12 | 0 | 0 | 0 | 1 | 1 | 3 | 4 | 1 | 6 | 272 | 69 | 69 | 0 | 98 | 169 | 29 | 36 | 23 | 7 | 8.97 | 1 | 17 | Training |
13 | 0 | 1 | 0 | 0 | 1 | 0 | 4 | 0 | 9 | 207 | 54 | 54 | 14 | 13 | 34 | 9 | 7 | 19 | 11 | 6.28 | 1 | 12 | Training |
14 | 0 | 0 | 0 | 1 | 1 | 0 | 4 | 0 | 8 | 21 | 36 | 32 | 54 | 117 | 5 | 4 | 15 | 9 | 5.11 | 1 | 15 | Validation | |
15 | 0 | 0 | 1 | 0 | 1 | 0 | 5 | 1 | 8 | 196 | 57 | 82 | 1 | 16 | 71 | 36 | 20 | 22 | 8 | 13.44 | 1 | 18 | Training |
17 | 1 | 0 | 0 | 0 | 1 | 1 | 4 | 0 | 9 | 138 | 19 | 41 | 25 | 92 | 201 | 12 | 10 | 8 | 7 | 11.15 | 1 | 13 | Training |
18 | 0 | 1 | 0 | 0 | 1 | 0 | 4 | 0 | 9 | 200 | 44 | 48 | 5 | 58 | 61 | 23 | 13 | 15 | 4 | 12.84 | 1 | 14 | Validation |
19 | 1 | 0 | 0 | 0 | 1 | 0 | 4 | 1 | 1 | 278 | 79 | 83 | 0 | 87 | 89 | 29 | 16 | 16 | 6 | 8.14 | 1 | 12 | Training |
21 | 0 | 0 | 1 | 0 | 1 | 0 | 4 | 1 | 9 | 158 | 26 | 33 | 16 | 78 | 117 | 21 | 6 | 19 | 8 | 8.68 | 1 | 15 | Training |
24 | 0 | 0 | 0 | 1 | 1 | 4 | 6 | 1 | 5 | 141 | 32 | 37 | 21 | 65 | 79 | 5 | 5 | 21 | 6 | 3.69 | 0 | 0 | Training |
25 | 0 | 1 | 0 | 0 | 1 | 2 | 4 | 0 | 6 | 142 | 19 | 40 | 21 | 39 | 94 | 12 | 13 | 19 | 3 | 6.94 | 1 | 14 | Training |
26 | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 | 1 | 224 | 55 | 66 | 6 | 134 | 437 | 14 | 14 | 19 | 8 | 9.37 | 0 | 0 | Training |
30 | 1 | 0 | 0 | 0 | 1 | 1 | 5 | 1 | 9 | 282 | 69 | 87 | 0 | 62 | 70 | 16 | 11 | 22 | 10 | 11.61 | 1 | 18 | Training |
32 | 1 | 0 | 0 | 0 | 1 | 2 | 4 | 1 | 6 | 213 | 70 | 70 | 6 | 59 | 104 | 17 | 11 | 20 | 9 | 12.84 | 1 | 15 | Validation |
33 | 0 | 0 | 1 | 0 | 0 | 1 | 4 | 0 | 2 | 128 | 29 | 35 | 15 | 69 | 78 | 60 | 47 | 18 | 4 | 23.81 | 0 | 0 | Training |
34 | 0 | 1 | 0 | 0 | 1 | 2 | 5 | 0 | 8 | 263 | 40 | 67 | 4 | 76 | 105 | 24 | 21 | 13 | 10 | 19.43 | 0 | 0 | Training |
35 | 1 | 0 | 0 | 0 | 1 | 2 | 7 | 0 | 7 | 119 | 23 | 57 | 19 | 39 | 74 | 12 | 13 | 18 | 10 | 11.77 | 0 | 0 | Training |
38 | 1 | 0 | 0 | 0 | 1 | 1 | 4 | 0 | 9 | 204 | 79 | 79 | 2 | 59 | 94 | 9 | 8 | 13 | 3 | 5.2 | 1 | 13 | Training |
39 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 1 | 9 | 61 | 4 | 18 | 73 | 25 | 69 | 15 | 9 | 14 | 10 | 6.85 | 0 | 0 | Training |
40 | 0 | 1 | 0 | 0 | 1 | 2 | 3 | 1 | 2 | 93 | 29 | 31 | 18 | 94 | 136 | 9 | 4 | 38 | 7 | 5.19 | 0 | 0 | Training |
42 | 1 | 0 | 0 | 0 | 1 | 3 | 5 | 0 | 9 | 174 | 40 | 47 | 11 | 68 | 99 | 16 | 8 | 17 | 10 | 10.5 | 0 | 0 | Training |
43 | 0 | 0 | 1 | 0 | 1 | 1 | 4 | 1 | 7 | 179 | 59 | 59 | 3 | 73 | 146 | 12 | 4 | 17 | 6 | 10.44 | 1 | 15 | Training |
44 | 0 | 1 | 0 | 0 | 1 | 0 | 3 | 1 | 6 | 284 | 43 | 82 | 4 | 14 | 40 | 14 | 7 | 12 | 5 | 9.8 | 1 | 15 | Training |
45 | 0 | 0 | 1 | 0 | 0 | 3 | 5 | 1 | 0 | 209 | 66 | 66 | 6 | 62 | 149 | 68 | 20 | 15 | 7 | 23.31 | 0 | 0 | Training |
46 | 1 | 0 | 0 | 0 | 1 | 2 | 4 | 1 | 3 | 163 | 36 | 44 | 9 | 50 | 74 | 9 | 5 | 16 | 11 | 6.3 | 0 | 0 | Training |
48 | 0 | 0 | 0 | 0 | 1 | 1 | 4 | 0 | 8 | 124 | 20 | 43 | 26 | 43 | 75 | 17 | 13 | 28 | 6 | 14.51 | 0 | 0 | Training |
51 | 0 | 0 | 0 | 0 | 1 | 0 | 7 | 0 | 8 | 324 | 75 | 78 | 0 | 90 | 177 | 6 | 7 | 21 | 5 | 8.04 | 1 | 14 | Training |
52 | 0 | 1 | 0 | 0 | 1 | 2 | 3 | 0 | 9 | 167 | 73 | 73 | 2 | 34 | 63 | 10 | 6 | 20 | 5 | 8.77 | 1 | 11 | Training |
53 | 0 | 1 | 0 | 0 | 1 | 3 | 4 | 1 | 9 | 94 | 30 | 30 | 28 | 72 | 133 | 52 | 41 | 16 | 6 | 14.39 | 1 | 14 | Training |
54 | 0 | 1 | 0 | 0 | 1 | 4 | 4 | 2 | 9 | 231 | 55 | 58 | 3 | 66 | 112 | 20 | 16 | 15 | 6 | 19.18 | 1 | 13 | Training |
55 | 0 | 0 | 0 | 1 | 1 | 0 | 4 | 1 | 8 | 158 | 38 | 38 | 22 | 55 | 80 | 20 | 20 | 14 | 3 | 10.22 | 1 | 16 | Training |
56 | 0 | 0 | 0 | 0 | 1 | 2 | 5 | 1 | 6 | 427 | 78 | 114 | 0 | 78 | 131 | 9 | 4 | 24 | 6 | 6.13 | 1 | 13 | Training |
57 | 1 | 0 | 0 | 0 | 1 | 1 | 4 | 1 | 7 | 124 | 41 | 41 | 12 | 24 | 56 | 53 | 56 | 13 | 4 | 22.5 | 1 | 17 | Training |
58 | 0 | 0 | 0 | 0 | 1 | 3 | 4 | 1 | 8 | 172 | 34 | 55 | 15 | 62 | 73 | 24 | 21 | 19 | 7 | 12.32 | 0 | 0 | Validation |
given me Python code with deployments answer with describe Task 1, Task 2, Task 3, and Task 4 (a, b,c, d,e,f) ?
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started