Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model _ selection import train _ test _ split from sklearn.utils import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
# Load the dataset
df = pd.read_csv("dataset.csv")
# (A) Calculate the number of cases of manipulators versus non-manipulators in the dataset and draw a bar plot.
# Count the number of manipulators and non-manipulators
manipulator_counts = df['MANIPULATOR'].value_counts()
# Plot the bar plot
plt.bar(manipulator_counts.index, manipulator_counts.values)
plt.xlabel('Manipulator')
plt.ylabel('Count')
plt.title('Manipulator vs Non-Manipulator Counts')
plt.xticks([0,1],['Non-Manipulator', 'Manipulator'])
plt.show()
# (B) Create an 80:20 partition and find the number of positives in the test data.
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('MANIPULATOR', axis=1), df['MANIPULATOR'], test_size=0.2, random_state=42)
# Count the number of positives in the test data
positives_in_test = y_test.sum()
print("Number of positives in the test data:", positives_in_test)
# (C) Upsample the dataset to create a balanced dataset.
# Separate majority and minority classes
majority_class = df[df['MANIPULATOR']==0]
minority_class = df[df['MANIPULATOR']==1]
# Upsample minority class
minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)
# Combine majority class with upsampled minority class
balanced_df = pd.concat([majority_class, minority_upsampled])
# Check the class distribution in the balanced dataset
print(balanced_df['MANIPULATOR'].value_counts())
# (D) Build models using this balanced dataset.
# Define features and target variable
X_balanced = balanced_df.drop('MANIPULATOR', axis=1)
y_balanced = balanced_df['MANIPULATOR']
# Initialize models
models ={
"Nave Bayes": GaussianNB(),
"KNN": KNeighborsClassifier(),
"SVM": SVC(),
"Logistic Regression": LogisticRegression(),
"Random Forest": RandomForestClassifier(),
"Adaboost": AdaBoostClassifier(),
"Gradientboost": GradientBoostingClassifier(),
"XGBoost": XGBClassifier()
}
# (E) Comment on which metric should be given preference for this dataset.
# Since the dataset has imbalanced classes, precision, recall, and F1-score are preferred metrics.
# We can also consider area under the ROC curve (ROC AUC) to evaluate the model's performance.
# (F) Finalize the model for each technique after Hyperparameter tuning using GridsearchCV based on the selected metric.
# Initialize results dictionary to store evaluation metrics
results ={}
# Loop through each model
for name, model in models.items():
# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid={}, scoring='f1')
grid_search.fit(X_balanced, y_balanced)
# Predict on test data
y_pred = grid_search.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1= f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
# Store evaluation metrics in results dictionary
results[name]={'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1 Score': f1, 'ROC AUC': roc_auc}
# (G) Compare the model performances with respect to different evaluation metrics.
results_df = pd.DataFrame(results)
print(results_df)
# (H) Comment on the most important features for predicting the manipulators.
# We can use feature importance scores from models like Random Forest or XGBoost to determine the most important features.
# (I) Downsample the dataset to create a balanced dataset.
# We'll skip this part since we've already upsampled the dataset.
# (F) Finalize the model for each technique after Hyperparameter tuning using GridsearchCV based on the selected metric.
# Initialize results dictionary to store evaluation metrics
results ={}
# Loop through each model
for name, model in models.items():
# Perform GridSearchCV for hyperparameter tuning
if name =="SVM":
param_grid ={'C': [0.1,1,10,100], 'gamma': [1,0.1,0.01,0.001], 'kernel': ['rbf', 'linear']}
elif name == "Random Forest":
param_grid ={'n_estimators': [100,200,300], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10,20,30,40,50]}
elif nam
Code is not perfect started getting error after
# Split the balanced data into train and test sets
X_train_balanced, X_t

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

AutoCAD Database Connectivity

Authors: Scott McFarlane

1st Edition

0766816400, 978-0766816404

More Books

Students also viewed these Databases questions

Question

In implementing ALU, why are full adders preferred

Answered: 1 week ago

Question

2. Define communication.

Answered: 1 week ago