Question

1 Approved Answer

Posted on Sep 29, 2024

Can you also explain how to call P1 from P2 and use the functions created in P1 in P2. P1 Make use of the scikit-learn

Can you also explain how to call P1 from P2 and use the functions created in P1 in P2.

Make use of the scikit-learn (sklearn) python package in your function implementations

Complete the Following Functions in task4.py:calculate_naive_metrics

Given a train dataframe, test dataframe, target_col and naive assumption split out the target column from the training and test dataframes to create a feature dataframes and a target series then calculate (rounded to 4 decimal places) accuracy, recall, precision and f1 score using the sklearn functions, the train and test target values and the naive assumption.

calculate_logistic_regression_metrics

Given a train dataframe, test dataframe, target_col and logreg_kwargs split out the target column from the training and test dataframes to create a feature dataframes and a target series. Then train a logistic regression model (initialized using the kwargs) on the training data and predict (both binary predictions and probability estimates) on the training and test data. Then using those predictions and estimates along with the target values calculate (rounded to 4 decimal places) accuracy, recall, precision, f1 score, false positive rate, false negative rate and area under the reciever operator curve (using probabilities for roc auc) for both training and test datasets.

For Feature Importance use the top 10 features selected by RFE and sort by absolute value of the coefficient from biggest to smallest (make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col and imp_col and the index is 0-9 you can do that with `df.reset_index(drop=True)` )

calculate_random_forest_metrics

Given a train dataframe, test dataframe, target_col and rf_kwargs split out the target column from the training and test dataframes to create a feature dataframes and a target series. Then train a random forest model (initialized using the kwargs) on the training data and predict (both binary predictions and probability estimates) on the training and test data. Then using those predictions and estimates along with the target values calculate (rounded to 4 decimal places) accuracy, recall, precision, f1 score, false positive rate, false negative rate and area under the reciever operator curve (using probabilities for roc auc) for both training and test datasets

For Feature Importance use the top 10 features using the built in feature importance attributes as sorted from biggest to smallest (make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col and imp_col and the index is 0-9 you can do that with `df.reset_index(drop=True)` )

calculate_gradient_boosting_metrics

Given a train dataframe, test dataframe, target_col and gb_kwargs split out the target column from the training and test dataframes to create a feature dataframes and a target series. Then train a gradient boosting model (initialized using the kwargs) on the training data and predict (both binary predictions and probability estimates) on the training and test data. Then using those predictions and estimates along with the target values calculate (rounded to 4 decimal places) accuracy, recall, precision, f1 score, false positive rate, false negative rate and area under the reciever operator curve (using probabilities for roc auc) for both training and test datasets

import numpy as np import pandas as pd from sklearn.metrics import * from sklearn.linear_model import * from sklearn.ensemble import * from sklearn.feature_selection import RFE

class ModelMetrics: def __init__(self, model_type:str,train_metrics:dict,test_metrics:dict,feature_importance_df:pd.DataFrame): self.model_type = model_type self.train_metrics = train_metrics self.test_metrics = test_metrics self.feat_imp_df = feature_importance_df self.feat_name_col = "Feature" self.imp_col = "Importance" def add_train_metric(self,metric_name:str,metric_val:float): self.train_metrics[metric_name] = metric_val

def add_test_metric(self,metric_name:str,metric_val:float): self.test_metrics[metric_name] = metric_val

def __str__(self): output_str = f"MODEL TYPE: {self.model_type} " output_str += f"TRAINING METRICS: " for key in sorted(self.train_metrics.keys()): output_str += f" - {key} : {self.train_metrics[key]:.4f} " output_str += f"TESTING METRICS: " for key in sorted(self.test_metrics.keys()): output_str += f" - {key} : {self.test_metrics[key]:.4f} " if self.feat_imp_df is not None: output_str += f"FEATURE IMPORTANCES: " for i in self.feat_imp_df.index: output_str += f" - {self.feat_imp_df[self.feat_name_col][i]} : {self.feat_imp_df[self.imp_col][i]:.4f} " return output_str

def calculate_naive_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, naive_assumption:int) -> ModelMetrics: # TODO: Write the necessary code to calculate accuracy, recall, precision and fscore given a train and test dataframe # and a train and test target series and naive assumption train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0 } naive_metrics = ModelMetrics("Naive",train_metrics,test_metrics,None) return naive_metrics

def calculate_logistic_regression_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, logreg_kwargs) -> tuple[ModelMetrics,LogisticRegression]: # TODO: Write the necessary code to train a logistic regression binary classifiaction model and calculate accuracy, recall, precision, fscore, # false positive rate, false negative rate and area under the reciever operator curve given a train and test dataframe and train and test target series # and keyword arguments for the logistic regrssion model model = LogisticRegression() train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } # TODO: Use RFE to select the top 10 features # make sure the column of feature names is named Feature # and the column of importances is named Importance # and the dataframe is sorted by ascending ranking then decending absolute value of Importance log_reg_importance = pd.DataFrame() log_reg_metrics = ModelMetrics("Logistic Regression",train_metrics,test_metrics,log_reg_importance)

return log_reg_metrics,model

def calculate_random_forest_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, rf_kwargs) -> tuple[ModelMetrics,RandomForestClassifier]: # TODO: Write the necessary code to train a random forest binary classification model and calculate accuracy, recall, precision, fscore, # false positive rate, false negative rate and area under the reciever operator curve given a train and test dataframe and train and test # target series and keyword arguments for the random forest model model = RandomForestClassifier() train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } # TODO: Reminder DONT use RFE for rf_importance # make sure the column of feature names is named Feature # and the column of importances is named Importance # and the dataframe is sorted by decending absolute value of Importance rf_importance = pd.DataFrame() rf_metrics = ModelMetrics("Random Forest",train_metrics,test_metrics,rf_importance)

return rf_metrics,model

def calculate_gradient_boosting_metrics(train_dataset:pd.DataFrame, test_dataset:pd.DataFrame, target_col:str, gb_kwargs) -> tuple[ModelMetrics,GradientBoostingClassifier]: # TODO: Write the necessary code to train a gradient boosting binary classification model and calculate accuracy, recall, precision, fscore, # false positive rate, false negative rate and area under the reciever operator curve given a train and test dataframe and train and test # target series and keyword arguments for the gradient boosting model model = GradientBoostingClassifier() train_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } test_metrics = { "accuracy" : 0, "recall" : 0, "precision" : 0, "fscore" : 0, "fpr" : 0, "fnr" : 0, "roc_auc" : 0 } # TODO: Reminder DONT use RFE for gb_importance # make sure the column of feature names is named Feature # and the column of importances is named Importance # and the dataframe is sorted by decending absolute value of Importance gb_importance = pd.DataFrame() gb_metrics = ModelMetrics("Gradient Boosting",train_metrics,test_metrics,gb_importance)

return gb_metrics,model

Now that you have written functions for different steps of the model building process you will put it all together. You will write code that trains a model with hyperparameters you determine (you should do any tuning locally or in a notebook ie don't tune your model in gradescope since the autograder will likely timeout). It will take in the CLAMP training data, train a model then predict on a test set and output values from 0 to 1 for each row and our autograder will compare your predictions with the correct answers and to get credit you will need a roc auc score of .9 or higher on the test set (should not require much hyperparameter tuning for this dataset). This is basically a simulation of how your model would perform in the production system using batch inference.

Deliverables:

Make use of any of the techniques we covered in this project to train a model and return predicted probabilities for each row of the test set as a DataFrame with columns index (same as your index from the input test df) and malware_score (predicted probabilities).

Complete the train_model_return_scores function in task5.py

import numpy as np import pandas as pd

def train_model_return_scores(train_df_path,test_df_path) -> pd.DataFrame: # TODO: Load and preprocess the train and test dfs # Train a sklearn model using training data at train_df_path # Use any sklearn model and return the test index and model scores

# TODO: output dataframe should have 2 columns # index : this should be the row index of the test df # malware_score : this should be your model's output for the row in the test df test_scores = pd.DataFrame() return test_scores