Question

1 Approved Answer

Posted on Sep 29, 2024

Problem 2: Census Dataset In Problem 2, you ou will be using census data from 1994 to attempt to predict whether or not a person

Problem 2: Census Dataset In Problem 2, you ou will be using census data from 1994 to attempt to predict whether or not a person has an annual salary greater than $50,000 based on other information provided in the census. You can find a description of the dataset here: Census Dataset Load the data stored in the tab-delimited file census.txt into a DataFrame named census. Use head() to display the first 10 rows of this DataFrame. We will now check to see how many rows and columns there are in the DataFrame. Print the shape of the census DataFrame. The last column is named salary. Each entry in this column is a string equal to either '<=50K' or '>50K'. Our goal is to create and compare several classification models for the purposes of predicting to which of these two classes an individual belongs based on the values of the other columns, which will be used as features in our models. Before creating any models, we will check the distribution of values in our target variable. Without creating any new DataFrame variables, select the salary column, and then call its value_counts() method. Display the result. We will now prepare our data by encoding the categorical features and splitting into training, validation, and test sets. Add a markdown cell with a level 3 header that reads: "Prepare the Data". We will start by separating the categorial and numerical features into different arrays. Note that the following 8 features are categorical in nature: workclass, education, marital_status, occupation, relationship, race, sex, and native_country. The remaining 6 features are numerical. Perform the following steps in a single code cell: Create a 2D array named X2_num by selecting the columns of census that represent numerical features. Create a 2D array named X2_cat by selecting the columns of census that represent categorical features. Create a 1D array named y2 by selecting the salary column. Print the shapes of all three of these arrays with messages as shown below. Add spacing to ensure that the shape tuples are left-aligned. Numerical Feature Array Shape: xxxx Categorical Feature Array Shape: xxxx Label Array Shape: xxxx Note: The variables created here should be arrays, and not DataFrames or Series. You will need to use .values. We will now perform one-hot encoding on the categorical variables. Perform the following steps in a single code cell: 1. Create a OneHotEncoder() object setting sparse=False. 2. Fit the encoder to the categorical features. 3. Use the encoder to encode the categorical features, storing the result in a variable named X2_enc. 4. Print the shape of X2_enc with a message as shown below. Encoded Feature Array Shape: xxxx We will now combine the numerical feature array with the encoded categorical feature array. Perform the following steps in a single code cell: 1. Use np.hstack to combine X2_num and X2_enc into a single array named X2. 2. Print the shape of X2 with a message as shown below. Feature Array Shape: xxxx We will now split the data into training, validation, and test sets, using an 70/15/15 split. Perform the following steps in a single code cell: Use train_test_split() to split the data into training and holdout sets using an 70/30 split. Name the resulting arrays X2_train, X2_hold, y2_train, and y2_hold. Set random_state=1. Use stratified sampling. Use train_test_split() to split the holdout data into validation and test sets using a 50/50 split. Name the resulting arrays X2_valid, X2_test, y2_valid, and y2_test. Set random_state=1. Use stratified sampling. Print the shapes of X2_train, X2_valid, and X2_test with messages as shown below. Add spacing to ensure that the shape tuples are left-aligned. Training Features Shape: xxxx Validation Features Shape: xxxx Test Features Shape: xxxx We will now create and evaluate a logistic regression model. Add a markdown cell with a level 3 header that reads: "Logistic Regression Model". Perform the following steps in a single code cell: 1. Create a logistic regression model named lr_mod setting solver='lbfgs', and max_iter=1000. Set penalty='none', unless that results in an error, in which case, set C=10e1000. 2. Fit your model to the training data. 3. Calculate the training and validation accuracy with messages as shown below. Add spacing to ensure that the accuracy scores are left-aligned. Round the scores to 4 decimal places. Training Accuracy: xxxx Validation Accuracy: xxxx We will now create and evaluate several decision tree models. We will use the validation score for these models to perform hyperparameter tuning. Add a markdown cell with a level 3 header that reads: "Decision Tree Models". Perform the following steps in a single code cell: 1. Create empty lists named dt_train_acc and dt_valid_acc. These lists will store the accuracy scores that we calculate for each model. 2. Create a range variable named depth_range to represent a sequence of integers from 2 to 30. 3. Loop over the values in depth_range. Every time the loop executes, perform the following steps. a. Use NumPy to set a random seed of 1. This should be done inside the loop. b. Create a decision tree model named temp_tree with max_depth equal to the current value from depth_range that is being considered. c. Fit the model to the training data. d. Calculate the training and validation accuracy for temp_tree, appending the resulting values to the appropriate lists. 4. Use np.argmax to determine the index of the maximum value in dt_valid_acc. Store the result in dt_idx. 5. Use dt_idx and depth_range to find the optimal value for the max_depth hyperparameter. Store the result in dt_opt_depth. 6. Use dt_idx with the lists dt_train_acc and dt_valid_acc to determine the training and validation accuracies for the optimal model found. 7. Display the values found in Steps 5 and 6 with messages as shown below. Add spacing to ensure that the values replacing the xxxx symbols are left-aligned. Round the accuracy scores to 4 decimal places. Optimal value for max_depth: xxxx Training Accuracy for Optimal Model: xxxx Validation Accuracy for Optimal Model: xxxx We will now plot the validation and training curves as a function of the max_depth parameter. Create a figure with two line plots on the same set of axes. One line plot should plot values of dt_train_acc against depth_range and the other should plot values of dt_valid_acc against depth_range. The x-axis should be labeled "Max Depth" and the y-axis should be labeled "Accuracy". The plot should contain a legend with two items that read "Training" and "Validation".