Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 21, 2024

Your task: We have learned how to implement k - means clustering algorithm in class. Now, you are asked to write a variant of this

Your task:

We have learned how to implement k

-

means clustering algorithm in class. Now, you are asked to write

a variant of this clustering algorithm specifically designed to cluster categorical features. Note that, K

-

means makes use of the Euclidean distance between the points, however Euclidean distance is not

useful for categorical features. Our new algorithm will make use of the Hamming distance. The

Hamming distance between two data objects is the number of categorical features that differ between

the two instances. Suppose you are given two points

1

and

2

that has

dimensions, then the distance

between these two points are computed by

(1, 2) = (1, 2)

= 1

,

(1, 2) = {0 1 = 2

1 1! = 2

For example, let

1 = [,,]

and

2 = [,,],

then the hamming distance is

1,

since there is one

difference

(

feature

3

is C for

1

and D for

2) .

This algorithm works very similar to K

-

means algorithm. However, instead of computing the mean

values for each feature to determine the centroids, the most frequent value is selected to define the

clusters, which is called cluster core. Initially, k sample points are randomly selected as the cluster

cores, then all the points that are closest to these cores are associated to form the clusters. Please be

aware that closeness is defined by the Hamming distance now. After the assignment, new cluster cores

are to be computed. However, the cores are computed by determining the most frequent values for

each feature within each cluster. For instance, let

s say we have the following

5

observations that are

clustered together.

Point

1 2

1

A B

2

B C

3

C A

4

A C

5

A D

The new core is mostfrequent

(1) =

,

mostfrequent

(1) =

,

.

.,

the core

= (

,

) .

In order to

quantify how well the clustering algorithm performed, we use a different cost function from k

-

means

clustering. We compute the total within cluster Hamming distances of all points from their cores

() = (,),

The cost function is defined as the sum of these values over all clusters

()

= 1

The algorithm can be summarized as follows:

1 .

K initial data points are randomly selected as the cores.

2 .

Assign each data point to the closest core

(

use hamming distance

) .

3 .

Compute the new cores for the clusters by using the most frequent observations of the all data

points that belong to each cluster.

4 .

Repeat step

2 - 3

until no change in

()

= 1 (

needs to be computed with hamming

distance

)

The above procedure needs to be replicated several times with random initial starts and the one that

gives the minimum

()

= 1,

is chosen to be the output.

Function Parameters:

X: This is the data set that needs to be a numpy array.

number

_

cluster: Number of clusters

replication

_

number: This is the parameter that tells the number of times the algorithm is replicated

(

in class example we had

10) .

epsilon: A small number to stop the replication when the difference of the previous cost and current

cost is below of

(

Remember that we selected

0.01

in class

)

Function Outputs:

best

_

cost: This is the minimum cost among many replications.

best

_

cluster: This is the best cluster among many replications, needs to be a numpy array.

best

_

core: This is the best cores among many replications, needs to be a numpy array.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Constraint Databases And Applications Second International Workshop On Constraint Database Systems Cdb 97 Delphi Greece January 1997 Cp 96 Workshop On Constraints And Databases Cambridge Ma Usa August 1996 Selected Papers Lncs 1191

Authors: Volker Gaede ,Alexander Brodsky ,Oliver Gunther ,Divesh Srivastava

1st Edition

★★★★★

Identify the resources your group will need to accomplish its goals. Include such things as members time, office space, funds, and equipment. By anticipating resources, you avoid getting into a...

Answered: 1 week ago

Previous Question Next Question