Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 30, 2024

Using the auto data set and using the scikit learn library 2 . Create and add a binary variable column called mpg _ high _

Using the auto data set and using the scikit learn library

2 .

Create and add a binary variable column called mpg

_

high

_

low to the dataset that is set to High if mpg is a value above

30,

and a Low if mpg is a value less than or equal to

30 .

Make sure the mpg

_

high

_

low column is of type category.

3 .

Check if the auto data is imbalanced with respect to mpg

_

high

_

low. Report the percentage of the data that belong to the two classes

(

High and Low

) .

4 .

Split the dataset into

75 %

training and

25 %

test and use

10

fold cross validation for the models below

5 .

Fit a logistic regression model to the training set to predict mpg

_

high

_

low using all the other features

/

variables except mpg

,

year, origin, and name. Predict the mpg

_

high

_

low using the test dataset and report the Accuracy, Precision, Recall, Specificity, and F

1

measure.

6 .

Alter the threshold for classifying a Low to

0.6

and report the changes in the test performance metrics from those reported in Qn

5 .

7 .

Find the optimal threshold by drawing the ROC curve. Change the threshold to the optimal value you found from the ROC curve and report the changes in the test performance metrics from those reported in Qn

5 .

8 .

Fit a Na

ve Bayes model to the training data to predict mpg

_

high

_

low using all the other features

/

variables except mpg

,

year, origin, and name. Predict the mpg

_

high

_

low using the test dataset. Plot the ROC curve and report the best threshold on the ROC curve plot. Report the AUC on the curve plot as well. Report the accuracy, precision, recall, specificity and F

1

score.

9 .

Fit a KNN model to the training data to predict mpg

_

high

_

low using all the other features

/

variables except mpg

,

year, origin, and name. Use a grid search between

3

and

10

to find the best value of k

.

Report the accuracy, precision, recall, specificity, F

1

score and AUC.

10 .

Fit a LDA model to the training data to predict mpg

_

high

_

low using all the other features

/

variables except mpg

,

year, origin, and name. Report the accuracy, precision, recall, specificity and F

1

score.

11 .

Summarize the performance of the all the above models by creating a dataframe with

4

columns

Model

_

Name, Accuracy, Precision, Recall, Specificity, F

1

Score. The data frame should contain one row for each model you built above with each of the columns filled in with the appropriate metric. Print out the dataframe. Which model performed the best from an accuracy point of view and which model performed best from a recall point of view without adjusting for the threshold?