Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Your task: We have learned how to implement k - means clustering algorithm in class. Now, you are asked to write a variant of this

Your task:
We have learned how to implement k-means clustering algorithm in class. Now, you are asked to write
a variant of this clustering algorithm specifically designed to cluster categorical features. Note that, K-
means makes use of the Euclidean distance between the points, however Euclidean distance is not
useful for categorical features. Our new algorithm will make use of the Hamming distance. The
Hamming distance between two data objects is the number of categorical features that differ between
the two instances. Suppose you are given two points 1 and 2 that has dimensions, then the distance
between these two points are computed by
(1,2)=(1,2)
=1
,
(1,2)={01=2
11!=2
For example, let 1=[,,] and 2=[,,], then the hamming distance is 1, since there is one
difference (feature 3 is C for 1 and D for 2).
This algorithm works very similar to K-means algorithm. However, instead of computing the mean
values for each feature to determine the centroids, the most frequent value is selected to define the
clusters, which is called cluster core. Initially, k sample points are randomly selected as the cluster
cores, then all the points that are closest to these cores are associated to form the clusters. Please be
aware that closeness is defined by the Hamming distance now. After the assignment, new cluster cores
are to be computed. However, the cores are computed by determining the most frequent values for
each feature within each cluster. For instance, lets say we have the following 5 observations that are
clustered together.
Point 12
1 A B
2 B C
3 C A
4 A C
5 A D
The new core is mostfrequent(1)=A, mostfrequent (1)=C, i.e., the core =(A,C). In order to
quantify how well the clustering algorithm performed, we use a different cost function from k-means
clustering. We compute the total within cluster Hamming distances of all points from their cores
()=(,),
in
The cost function is defined as the sum of these values over all clusters
()
=1
The algorithm can be summarized as follows:
1. K initial data points are randomly selected as the cores.
2. Assign each data point to the closest core (use hamming distance).
3. Compute the new cores for the clusters by using the most frequent observations of the all data
points that belong to each cluster.
4. Repeat step 2-3 until no change in ()
=1(needs to be computed with hamming
distance)
The above procedure needs to be replicated several times with random initial starts and the one that
gives the minimum ()
=1, is chosen to be the output.
Function Parameters:
X: This is the data set that needs to be a numpy array.
number_cluster: Number of clusters
replication_number: This is the parameter that tells the number of times the algorithm is replicated
(in class example we had 10).
epsilon: A small number to stop the replication when the difference of the previous cost and current
cost is below of (Remember that we selected 0.01 in class)
Function Outputs:
best_cost: This is the minimum cost among many replications.
best_cluster: This is the best cluster among many replications, needs to be a numpy array.
best_core: This is the best cores among many replications, needs to be a numpy array.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions