Train Test Split from sklearn model selection import train test split X train, X test, y train, y test train test split ( X new, y , test size 0 3 , random state 1 2 3 ) 3 0 of the given data is used as testing data, the remaining 7 0 is training data This selection was made randomly from sklearn ensemble import RandomForestClassifier rf default RandomForestClassifier ( random state 1 2 3 ) rf default fit ( X train, y train ) y predict rf rf default predict ( X test ) def evaluate model ( y predict, y test ) Evaluate the performance of model using the test data Use accuracy score, precision, recall and confusion matrix as performance metrics confusion matrix metrics confusion matrix ( y test, y predict ) sns heatmap ( confusion matrix , annot True, fmt d ) print ( Accuracy , 2 f format ( metrics accuracy score ( y test, y predict ) ) , Precision , 2 f format ( metrics precision score ( y test, y predict ) ) , Recall , 2 f format ( metrics recall score ( y test, y predict ) ) , Confusion Matrix ) evaluate model ( y predict rf , y test ) We obtain highest accuracy level, precision and recall However, we can use grid search cross validation to check our models performance again Accoding to Breiman ( 2 0 0 1 ) , who proposed Random Forest, max features and n estimators ar most important parameters of Random Forest We can try to optimize them In addition to this , we may try to balance the class weights to overcome imbalance data problem params 'max features' auto , sqrt , log 2 , ' n estimators' 3 0 0 , 5 0 0 , 7 0 0 , 1 0 0 0 rf default RandomForestClassifier ( class weight balanced subsample , random state 1 2 3 ) stratified kfold StratifiedKFold ( n splits 1 0 , shuffle True, random state 1 2 3 ) grid search GridSearchCV ( rf default, params, n jobs 1 , cv stratified kfold, verbose 2 ) grid search results grid search fit ( X new, y values ravel ( ) ) target 'class' X mushroom data drop ( columns target ) y mushroom data target print ( f ' Y shape y shape ' ) print ( f ' X shape X shape ' ) from sklearn model selection import train test split X train, X test, y train, y test train test split ( X , y , test size 0 2 , random state 4 2 ) print ( f ' shape of X Train X train shape ' ) print ( f ' shape of X Test X test shape ' ) print ( f ' shape of Y Train y train shape ' ) print ( f ' shape of Y Test y test shape ' ) acc baseline y train value counts ( normalize True ) max ( ) print ( f ' Accuracy of baseline acc baseline ' ) from sklearn preprocessing import OrdinalEncoder from sklearn pipeline import make pipeline from sklearn ensemble import RandomForestClassifier clf make pipeline ( OrdinalEncoder ( ) , RandomForestClassifier ( random state 4 2 ) ) params 'randomforestclassifier n estimators' range ( 2 5 , 1 0 0 , 2 5 ) , 'randomforestclassifier max depth' range ( 1 0 , 7 0 , 1 0 ) params summarize results print ( Best f using s ( grid search results best score , grid search results best params ) ) from sklearn model selection import GridSearchCV model GridSearchCV ( clf , param grid params, cv 5 , n jobs 1 , verbose 1 ) model model fit ( X train , y train ) cv results pd DataFrame ( model cv results ) cv results sort values ( by 'rank test score' ) evaluate model ( y pred, y ) cv results sort values ( by 'rank test score' ) rf model RandomForestClassifier ( class weight balanced subsample , max features 'auto', n estimators 3 0 0 , random state 1 2 3 ) rf default fit ( X , y ) y pred rf default predict ( X ) features X test columns importances model best estimator named steps ' randomforestclassifier ' feature importances feat imp pd Series ( importances , index features ) sort values ( ) feat imp tail ( ) plot ( kind 'barh' ) plt xlabel ( Gini Importance ) plt ylabel ( Feature ) plt title ( Feature Importance ) THE PYTHON CODE GIVEN ABOVE IS RELATED TO RANDOM FOREST CLASSIFICATION IN THE DATA SCIENCE COURSE PLEASE INTERPRET THIS CODE AND PREPARE A REPORT and presentation ACCORDING TO THE SUBJECTS AND CODES

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 31, 2024

#Train Test Split from sklearn.model _ selection import train _ test _ split X _ train, X _ test, y _ train, y _ test

#Train Test Split

from sklearn.model

_

selection import train

_

test

_

split

_

train, X

_

test, y

_

train, y

_

test

=

train

_

test

_

split

(

_

new, y

,

test

_

size

= 0.3,

random

_

state

= 123)

30 %

of the given data is used as testing data, the remaining

70 %

is training data. This selection was made randomly.

% %

from sklearn.ensemble import RandomForestClassifier

_

default

=

RandomForestClassifier

(

random

_

state

= 123)

_

default.fit

(

_

train, y

_

train

)

_

predict

_

=

_

default.predict

(

_

test

)

% %

def evaluate

_

model

(

_

predict, y

_

test

)

# Evaluate the performance of model using the test data.

# Use accuracy score, precision, recall and confusion matrix as performance metrics.

confusion

_

matrix

_=

metrics.confusion

_

matrix

(

_

test, y

_

predict

)

sns

.

heatmap

(

confusion

_

matrix

_,

annot

=

True, fmt

= "

")

("

Accuracy:

", " {

. 2

} " .

format

(

metrics

.

accuracy

_

score

(

_

test, y

_

predict

)),

"

Precision:

", " {

. 2

} " .

format

(

metrics

.

precision

_

score

(

_

test, y

_

predict

)),

"

Recall:

", " {

. 2

} " .

format

(

metrics

.

recall

_

score

(

_

test, y

_

predict

)),

"

Confusion Matrix:

")

% %

evaluate

_

model

(

_

predict

_

,

_

test

)

#We obtain highest accuracy level, precision and recall. However, we can use grid search cross validation to check our models performance again.

#Accoding to Breiman

(2001),

who proposed Random Forest, max

_

features and n

_

estimators ar most important parameters of Random Forest. We can try to optimize them.

#In addition to this

,

we may try to balance the class weights to overcome imbalance data problem.

% %

params

= {

'max

_

features':

["

auto

",

"sqrt", "log

2 "],

'

_

estimators':

[300, 500, 700, 1000]

}

% %

_

default

=

RandomForestClassifier

(

class

_

weight

=

"balanced

_

subsample", random

_

state

= 123)

stratified

_

kfold

=

StratifiedKFold

(

_

splits

= 10,

shuffle

=

True, random

_

state

= 123)

grid

_

=

GridSearchCV

(

_

default, params, n

_

jobs

= - 1,

=

stratified

_

kfold, verbose

= 2)

grid

_

_

results

=

grid

_

search.fit

(

_

new, y

.

values.ravel

())

% %

target

=

'class'

=

mushroom

_

data.drop

(

columns

= [

target

])

=

mushroom

_

data

[

target

]

(

'

Y shape

= {

.

shape

}')

(

'

X shape

= {

.

shape

}')

% %

from sklearn.model

_

selection import train

_

test

_

split

_

train, X

_

test, y

_

train, y

_

test

=

train

_

test

_

split

(

,

,

test

_

size

= 0.2,

random

_

state

= 42)

(

'

shape of X Train

= {

_

train.shape

}')

(

'

shape of X Test

= {

_

test.shape

}')

(

'

shape of Y Train

= {

_

train.shape

}')

(

'

shape of Y Test

= {

_

test.shape

}')

% %

acc

_

baseline

=

_

train.value

_

counts

(

normalize

=

True

) .

max

()

(

'

Accuracy of baseline

= {

acc

_

baseline

}')

% %

from sklearn.preprocessing import OrdinalEncoder

from sklearn.pipeline import make

_

pipeline

from sklearn.ensemble import RandomForestClassifier

clf

=

make

_

pipeline

(

OrdinalEncoder

(),

RandomForestClassifier

(

random

_

state

= 42))

params

= {

'randomforestclassifier

__

_

estimators': range

(25, 100, 25),

'randomforestclassifier

__

max

_

depth': range

(10, 70, 10)

}

params

% %

# summarize results

("

Best:

%

f using

%

" % (

grid

_

_

results.best

_

score

_,

grid

_

_

results.best

_

params

_))

% %

from sklearn.model

_

selection import GridSearchCV

model

=

GridSearchCV

(

clf

,

param

_

grid

=

params,

= 5,

_

jobs

= - 1,

verbose

= 1

)

model

% %

model.fit

(

_

train

,

_

train

)

% %

_

results

=

.

DataFrame

(

model

.

_

results

_)

_

results.sort

_

values

(

=

'rank

_

test

_

score'

)

% %

#evaluate

_

model

(

_

pred, y

)

_

results.sort

_

values

(

=

'rank

_

test

_

score'

)

_

model

=

RandomForestClassifier

(

class

_

weight

=

"balanced

_

subsample", max

_

features

=

'auto', n

_

estimators

= 300,

random

_

state

= 123)

_

default.fit

(

,

)

_

pred

=

_

default.predict

(

)

% %

features

=

_

test.columns

importances

=

model.best

_

estimator

_.

named

_

steps

['

randomforestclassifier

'] .

feature

_

importances

_

feat

_

imp

=

.

Series

(

importances

,

index

=

features

) .

sort

_

values

()

feat

_

imp.tail

() .

plot

(

kind

=

'barh'

)

plt

.

xlabel

("

Gini Importance"

)

plt

.

ylabel

("

Feature

")

plt

.

title

("

Feature Importance"

)

; THE PYTHON CODE GIVEN ABOVE IS RELATED TO RANDOM FOREST CLASSIFICATION IN THE DATA SCIENCE COURSE.

PLEASE INTERPRET THIS CODE AND PREPARE A REPORT and presentation ACCORDING TO THE SUBJECTS AND CODES.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Linked Data A Geographic Perspective

Authors: Glen Hart, Catherine Dolbear

1st Edition

★★★★★

1. Without doing any further research than what you learned in this chapter, what other steps would you suggest Google take to improve employee retention?pg 87

Answered: 1 week ago

Previous Question Next Question