Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Your task: We have learned how to implement k - means clustering algorithm in class. Now, you are asked to write a variant of this
Your task:
We have learned how to implement kmeans clustering algorithm in class. Now, you are asked to write
a variant of this clustering algorithm specifically designed to cluster categorical features. Note that, K
means makes use of the Euclidean distance between the points, however Euclidean distance is not
useful for categorical features. Our new algorithm will make use of the Hamming distance. The
Hamming distance between two data objects is the number of categorical features that differ between
the two instances. Suppose you are given two points and that has dimensions, then the distance
between these two points are computed by
For example, let and then the hamming distance is since there is one
difference feature is C for and D for
This algorithm works very similar to Kmeans algorithm. However, instead of computing the mean
values for each feature to determine the centroids, the most frequent value is selected to define the
clusters, which is called cluster core. Initially, k sample points are randomly selected as the cluster
cores, then all the points that are closest to these cores are associated to form the clusters. Please be
aware that closeness is defined by the Hamming distance now. After the assignment, new cluster cores
are to be computed. However, the cores are computed by determining the most frequent values for
each feature within each cluster. For instance, lets say we have the following observations that are
clustered together.
Point
A B
B C
C A
A C
A D
The new core is mostfrequentA mostfrequent C ie the core AC In order to
quantify how well the clustering algorithm performed, we use a different cost function from kmeans
clustering. We compute the total within cluster Hamming distances of all points from their cores
in
The cost function is defined as the sum of these values over all clusters
The algorithm can be summarized as follows:
K initial data points are randomly selected as the cores.
Assign each data point to the closest core use hamming distance
Compute the new cores for the clusters by using the most frequent observations of the all data
points that belong to each cluster.
Repeat step until no change in
needs to be computed with hamming
distance
The above procedure needs to be replicated several times with random initial starts and the one that
gives the minimum
is chosen to be the output.
Function Parameters:
X: This is the data set that needs to be a numpy array.
numbercluster: Number of clusters
replicationnumber: This is the parameter that tells the number of times the algorithm is replicated
in class example we had
epsilon: A small number to stop the replication when the difference of the previous cost and current
cost is below of Remember that we selected in class
Function Outputs:
bestcost: This is the minimum cost among many replications.
bestcluster: This is the best cluster among many replications, needs to be a numpy array.
bestcore: This is the best cores among many replications, needs to be a numpy array.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started