Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 21, 2024

Solve this with python Your task: We have learned how to implement k - means clustering algorithm in class. Now, you are asked to write

Solve this with python

"Your task:

We have learned how to implement k

-

means clustering algorithm in class. Now, you are asked to write

a variant of this clustering algorithm specifically designed to cluster categorical features. Note that, K

-

means makes use of the Euclidean distance between the points, however Euclidean distance is not

useful for categorical features. Our new algorithm will make use of the Hamming distance. The

Hamming distance between two data objects is the number of categorical features that differ between

the two instances. Suppose you are given two points

1

and

2

that has

dimensions, then the distance

between these two points are computed by

(1, 2) = (1, 2)

= 1

,

(1, 2) = {0 1 = 2

1 1! = 2

For example, let

1 = [,,]

and

2 = [,,],

then the hamming distance is

1,

since there is one

difference

(

feature

3

is C for

1

and D for

2) .

This algorithm works very similar to K

-

means algorithm. However, instead of computing the mean

values for each feature to determine the centroids, the most frequent value is selected to define the

clusters, which is called cluster core. Initially, k sample points are randomly selected as the cluster

cores, then all the points that are closest to these cores are associated to form the clusters. Please be

aware that closeness is defined by the Hamming distance now. After the assignment, new cluster cores

are to be computed. However, the cores are computed by determining the most frequent values for

each feature within each cluster. For instance, let

s say we have the following

5

observations that are

clustered together.

Point

1 2

1

A B

2

B C

3

C A

4

A C

5

A D

The new core is mostfrequent

(1) =

,

mostfrequent

(1) =

,

.

.,

the core

= (

,

) .

In order to

quantify how well the clustering algorithm performed, we use a different cost function from k

-

means

clustering. We compute the total within cluster Hamming distances of all points from their cores

() = (,),

The cost function is defined as the sum of these values over all clusters

()

= 1

The algorithm can be summarized as follows:

1 .

K initial data points are randomly selected as the cores.

2 .

Assign each data point to the closest core

(

use hamming distance

) .

3 .

Compute the new cores for the clusters by using the most frequent observations of the all data

points that belong to each cluster.

4 .

Repeat step

2 - 3

until no change in

()

= 1 (

needs to be computed with hamming

distance

)

The above procedure needs to be replicated several times with random initial starts and the one that

gives the minimum

()

= 1,

is chosen to be the output.

Function Parameters:

X: This is the data set that needs to be a numpy array.

number

_

cluster: Number of clusters

replication

_

number: This is the parameter that tells the number of times the algorithm is replicated

(

in class example we had

10) .

epsilon: A small number to stop the replication when the difference of the previous cost and current

cost is below of

(

Remember that we selected

0.01

in class

)

Function Outputs:

best

_

cost: This is the minimum cost among many replications.

best

_

cluster: This is the best cluster among many replications, needs to be a numpy array.

best

_

core: This is the best cores among many replications, needs to be a numpy array.

Instructions:

Two sample data sets are provided as HW

6

TestSet

1 .

csv and HW

6

TestSet

2 .

csv and sample solution

are given below. Please note that labels representing the clusters might be a bit different from what

you have seen here, but the cost should be similar. Please use the template uploaded to the Ninova,

read the comments in the template and strictly follow the instructions. Do the necessary testing of your

function in a python file other than the python file your function is written. Use the

initial

_

cores

(

,

number

_

cluster

)

function provided in the template to make your initial cluster

assignments!!! Do not write your own initialization function as it will produce different results. Also

do not change the random number seed set in the k

_

core

_

clustering function"

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Moving Objects Databases

Authors: Ralf Hartmut Güting, Markus Schneider

1st Edition

★★★★★

What three things do organizational insiders have to help them interpret organizational events that newcomers lack? How can organizations help newcomers gain these things?

Answered: 1 week ago

Previous Question Next Question