Question
This includes well-commented code as well as a README.txt. If documentation is poor for any part of the project.Use python. Helper Functions (30 points) Implement
This includes well-commented code as well as a README.txt. If documentation is poor for any part of the project.Use python. Helper Functions (30 points) Implement the following functions in a file titled helpers.py. You must implement the following functions: load_data(fname) - This function should take a filename and load the data in the file into a Pandas Dataframe. The Dataframe should be returned from the function. clean_data(df) - This function should take a Pandas Dataframe and either remove or replace all NaN/Inf values. If you replace the NaN values, you must choose how to replace them (with mean, median, fixed value, etc.). Your choice should be clearly indicated in your documentation. This function should also remove any columns of the data that are not numerical features. This function will return a cleaned Dataframe. split_data(df) - This function should take a Pandas Dataframe and split the Dataframe into training and testing data. This function should split the data into 80% for training and 20% for testing. You can do this randomly or use the first 80% for training and the remaining for testing. Make your choice clear in the documentation. This function will return four Dataframes: X_train, y_train, X_test, and y_test. Multi-Class Classification (70 points) Implement the following functions in a file titled multiclass_classification.py. In this assignment, you will perform multi-class classification using the network traffic data. We want you to do this in two ways: Direct Multi-Class Classification (30 points) Directly use our previous methods for multi-class classification (including Decision Trees and KNN) to predict multiple classes. Implement the following functions: direct_multiclass_train(model_name, X_train, y_train) - This function should take the model_name (dt, knn, mlp, rf) as input along with the training data (two Dataframes) and return a trained model. direct_multiclass_test(model, X_test, y_test) - This function should take a trained model and evaluate the model on the test data, returning an accuracy value. Direct Multi-Class Classification with Resampling (20 points) Perform data resampling to handle the unbalanced data distribution, and then conduct multiclass classification using MLP and random forest. Implement the following new functions: data_resampling(df, sampling_strategy) - This function should take the dataframe as input, undersample it using sampling_strategy, and return the resampled df. Hierarchical Multi-Class Classification (30 points) Perform binary classification first (benign vs. malicious) using MLP. Once a sample has been identified as malicious, perform multi-class classification to identify what kind of malicious activity is occurring using random forest. Implement the following new functions: improved_data_split(df) - This function will take the original data df into train and test sets that both contain all the categories. Return train and test dataframes: df_train, and df_test. get_binary_dataset(df) - Convert df into a binary dataset and return it.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started