Question

1 Approved Answer

Posted on Sep 25, 2024

Make use of the scikit-learn (sklearn) python package in your function implementations Complete the Following Functions in task4.py: calculate_naive_metrics Given a train dataframe, test dataframe,

Make use of the scikit-learn (sklearn) python package in your function implementations

Complete the Following Functions in task4.py:calculate_naive_metrics

Given a train dataframe, test dataframe, target_col and naive assumption split out the target column from the training and test dataframes to create a feature dataframes and a target series then calculate (rounded to 4 decimal places) accuracy, recall, precision and f1 score using the sklearn functions, the train and test target values and the naive assumption.

calculate_logistic_regression_metrics

Given a train dataframe, test dataframe, target_col and logreg_kwargs split out the target column from the training and test dataframes to create a feature dataframes and a target series. Then train a logistic regression model (initialized using the kwargs) on the training data and predict (both binary predictions and probability estimates) on the training and test data. Then using those predictions and estimates along with the target values calculate (rounded to 4 decimal places) accuracy, recall, precision, f1 score, false positive rate, false negative rate and area under the reciever operator curve (using probabilities for roc auc) for both training and test datasets.

For Feature Importance use the top 10 features selected by RFE and sort by absolute value of the coefficient from biggest to smallest (make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col and imp_col and the index is 0-9 you can do that with `df.reset_index(drop=True)` )

calculate_random_forest_metrics

Given a train dataframe, test dataframe, target_col and rf_kwargs split out the target column from the training and test dataframes to create a feature dataframes and a target series. Then train a random forest model (initialized using the kwargs) on the training data and predict (both binary predictions and probability estimates) on the training and test data. Then using those predictions and estimates along with the target values calculate (rounded to 4 decimal places) accuracy, recall, precision, f1 score, false positive rate, false negative rate and area under the reciever operator curve (using probabilities for roc auc) for both training and test datasets

For Feature Importance use the top 10 features using the built in feature importance attributes as sorted from biggest to smallest (make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col and imp_col and the index is 0-9 you can do that with `df.reset_index(drop=True)` )

calculate_gradient_boosting_metrics

Given a train dataframe, test dataframe, target_col and gb_kwargs split out the target column from the training and test dataframes to create a feature dataframes and a target series. Then train a gradient boosting model (initialized using the kwargs) on the training data and predict (both binary predictions and probability estimates) on the training and test data. Then using those predictions and estimates along with the target values calculate (rounded to 4 decimal places) accuracy, recall, precision, f1 score, false positive rate, false negative rate and area under the reciever operator curve (using probabilities for roc auc) for both training and test datasets

import numpy as np import pandas as pd from sklearn.metrics import * from sklearn.linear_model import * from sklearn.ensemble import * from sklearn.feature_selection import RFE

class ModelMetrics: def __init__(self, model_type:str,train_metrics:dict,test_metrics:dict,feature_importance_df:pd.DataFrame): self.model_type = model_type self.train_metrics = train_metrics self.test_metrics = test_metrics self.feat_imp_df = feature_importance_df self.feat_name_col = "Feature" self.imp_col = "Importance" def add_train_metric(self,metric_name:str,metric_val:float): self.train_metrics[metric_name] = metric_val

def add_test_metric(self,metric_name:str,metric_val:float): self.test_metrics[metric_name] = metric_val

def __str__(self): output_str = f"MODEL TYPE: {self.model_type} " output_str += f"TRAINING METRICS: " for key in sorted(self.train_metrics.keys()): output_str += f" - {key} : {self.train_metrics[key]:.4f} " output_str += f"TESTING METRICS: " for key in sorted(self.test_metrics.keys()): output_str += f" - {key} : {self.test_metrics[key]:.4f} " if self.feat_imp_df is not None: output_str += f"FEATURE IMPORTANCES: " for i in self.feat_imp_df.index: output_str += f" - {self.feat_imp_df[self.feat_name_col][i]} : {self.feat_imp_df[self.imp_col][i]:.4f} " return output_str

def calculate_naive_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, naive_assumption:int) -> ModelMetrics: # TODO: Write the necessary code to calculate accuracy, recall, precision and fscore given a train and test dataframe # and a train and test target series and naive assumption train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0 } naive_metrics = ModelMetrics("Naive",train_metrics,test_metrics,None) return naive_metrics

def calculate_logistic_regression_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, logreg_kwargs) -> tuple[ModelMetrics,LogisticRegression]: # TODO: Write the necessary code to train a logistic regression binary classifiaction model and calculate accuracy, recall, precision, fscore, # false positive rate, false negative rate and area under the reciever operator curve given a train and test dataframe and train and test target series # and keyword arguments for the logistic regrssion model model = LogisticRegression() train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } # TODO: Use RFE to select the top 10 features # make sure the column of feature names is named Feature # and the column of importances is named Importance # and the dataframe is sorted by ascending ranking then decending absolute value of Importance log_reg_importance = pd.DataFrame() log_reg_metrics = ModelMetrics("Logistic Regression",train_metrics,test_metrics,log_reg_importance)

return log_reg_metrics,model

def calculate_random_forest_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, rf_kwargs) -> tuple[ModelMetrics,RandomForestClassifier]: # TODO: Write the necessary code to train a random forest binary classification model and calculate accuracy, recall, precision, fscore, # false positive rate, false negative rate and area under the reciever operator curve given a train and test dataframe and train and test # target series and keyword arguments for the random forest model model = RandomForestClassifier() train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } # TODO: Reminder DONT use RFE for rf_importance # make sure the column of feature names is named Feature # and the column of importances is named Importance # and the dataframe is sorted by decending absolute value of Importance rf_importance = pd.DataFrame() rf_metrics = ModelMetrics("Random Forest",train_metrics,test_metrics,rf_importance)

return rf_metrics,model

def calculate_gradient_boosting_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, gb_kwargs) -> tuple[ModelMetrics,GradientBoostingClassifier]: # TODO: Write the necessary code to train a gradient boosting binary classification model and calculate accuracy, recall, precision, fscore, # false positive rate, false negative rate and area under the reciever operator curve given a train and test dataframe and train and test # target series and keyword arguments for the gradient boosting model model = GradientBoostingClassifier() train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } # TODO: Reminder DONT use RFE for gb_importance # make sure the column of feature names is named Feature # and the column of importances is named Importance # and the dataframe is sorted by decending absolute value of Importance gb_importance = pd.DataFrame() gb_metrics = ModelMetrics("Gradient Boosting",train_metrics,test_metrics,gb_importance)

return gb_metrics,model