Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Oct 07, 2024

import numpy as np from collections import Counter from sklearn import datasets, model _ selection # No other libraries will be imported # load the

import numpy as np

from collections import Counter

from sklearn import datasets, model

_

selection

# No other libraries will be imported

# load the Iris Dataset, which contains

150

samples.

# each sample has

4

features.

# the dataset contains

3

classes of

50

instances each, where each class refers to a type of iris plant.

iris

=

datasets.load

_

iris

()

=

.

array

(

iris

.

data

)

# features, numeric attributes.

[

Sepal length, Sepal Width, Petal length, Petal width

]

=

.

array

(

iris

.

target

)

# labels: class

- 0,

class

- 1,

class

- 2

_

train, X

_

test, Y

_

train, Y

_

test

=

model

_

selection.train

_

test

_

split

(

,

,

test

_

size

= 0.25,

random

_

state

= 0)

("

Train Shape:", X

_

train.shape

)

("

Train Shape:", X

_

test.shape

)

3 .

Calculate Information Gain for each attribute

(

numeric

),

and show the feature that should be used first when build a decision tree.

step

- 1

: find the best cutpoint for each attribute.

(

find value to split the data

)

step

- 2

: calculate the information gain for each attribute.

(

decide the order of attributes when build DT

)

- - - - - - - - - - - - - - - - - - - -

Some helper functions

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# calculate Entropy for a given distribution H

(

) .

def entropy

(

probabilities: list

) - >

float:

return sum

([-

*

.

log

2 (

)

for p in probabilities if p

> 0])

# given a list of labels, return the probability for each class P

(

) .

def class

_

probabilities

(

labels: list

) - >

list:

total

_

count

=

len

(

labels

)

return

[

label

_

count

/

total

_

count for label

_

count in Counter

(

labels

) .

values

()]

# calculate the Entropy H

(

)

for a given list of labels.

def data

_

entropy

(

labels: list

) - >

float:

return entropy

(

class

_

probabilities

(

labels

))

# split data into two sub

-

groups

[

group

1,

goup

2]

based on attribute

[

feature

_

idx

]

and value

[

feature

_

val

]

# if sample

[

feature

_

idx

] <

feature

_

val:

# group

1 < -

sample

# else:

# group

2 < -

sample

def split

_

data

(

data: np

.

array, feature

_

idx: int, feature

_

val: float

) - >

tuple:

mask

_

below

_

threshold

=

data

[

,

feature

_

idx

] <

feature

_

val

group

1 =

data

[

mask

_

below

_

threshold

]

group

2 =

data

[

~mask

_

below

_

threshold

]

return group

1,

group

2

# calculate the entropy for current partition. H

(

|

=

feature

_

val

)

def partition

_

entropy

(

1_

labels: list, g

2_

labels:list

) - >

float:

total

_

count

=

len

(

1_

labels

) +

len

(

2_

labels

)

#weighted combination of conditional entropy in both group

1

and group

2 .

return data

_

entropy

(

1_

labels

) * (

len

(

1_

labels

) /

total

_

count

) +

data

_

entropy

(

2_

labels

) * (

len

(

2_

labels

) /

total

_

count

)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - -

Examples to use the Helper functions

- - - - - - - - - - - - - - - - - - - - - - -

# calculate the H

(

)

for the train and test data:

(

data

_

entropy

(

_

train

))

(

data

_

entropy

(

_

test

))

## to split the data based on feature

_

idx and feature

_

val:

train

_

data

=

.

concatenate

((

_

train, np

.

reshape

(

_

train,

(- 1, 1))),

axis

= 1)

# concatenate

[

_

train, Y

_

train

]

(

train

_

data.shape

)

# split the data into two subgroups

1,

2 =

split

_

data

(

train

_

data, feature

_

idx

= 1,

feature

_

val

= 3)

(

1 .

shape

)

(

2 .

shape

)

# calculate the weighted entropy for the current split.

(

partition

_

entropy

(

1 [

, - 1],

2 [

, - 1]))

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Your implementation

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# Initialize variables to store the best cutpoint and information gain for each attribute

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Printing

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

#print the calculated cutpoint

[

feature

_

val

]

and information gain for each attribute.

# print the feature should be used first when build decision tree.

Please help me complete this code

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Concepts of Database Management

Authors: Philip J. Pratt, Mary Z. Last

8th edition

★★★★★

The Hotel California found that it frequently turned down a customer in the lobby because a room was reserved for a customer who never showed up. The manager felt that the hotel's policy of...

Answered: 1 week ago

Previous Question Next Question