Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Solve this with python Your task: We have learned how to implement k - means clustering algorithm in class. Now, you are asked to write

Solve this with python
"Your task:
We have learned how to implement k-means clustering algorithm in class. Now, you are asked to write
a variant of this clustering algorithm specifically designed to cluster categorical features. Note that, K-
means makes use of the Euclidean distance between the points, however Euclidean distance is not
useful for categorical features. Our new algorithm will make use of the Hamming distance. The
Hamming distance between two data objects is the number of categorical features that differ between
the two instances. Suppose you are given two points 1 and 2 that has dimensions, then the distance
between these two points are computed by
(1,2)=(1,2)
=1
,
(1,2)={01=2
11!=2
For example, let 1=[,,] and 2=[,,], then the hamming distance is 1, since there is one
difference (feature 3 is C for 1 and D for 2).
This algorithm works very similar to K-means algorithm. However, instead of computing the mean
values for each feature to determine the centroids, the most frequent value is selected to define the
clusters, which is called cluster core. Initially, k sample points are randomly selected as the cluster
cores, then all the points that are closest to these cores are associated to form the clusters. Please be
aware that closeness is defined by the Hamming distance now. After the assignment, new cluster cores
are to be computed. However, the cores are computed by determining the most frequent values for
each feature within each cluster. For instance, lets say we have the following 5 observations that are
clustered together.
Point 12
1 A B
2 B C
3 C A
4 A C
5 A D
The new core is mostfrequent(1)=A, mostfrequent (1)=C, i.e., the core =(A,C). In order to
quantify how well the clustering algorithm performed, we use a different cost function from k-means
clustering. We compute the total within cluster Hamming distances of all points from their cores
()=(,),
in
The cost function is defined as the sum of these values over all clusters
()
=1
The algorithm can be summarized as follows:
1. K initial data points are randomly selected as the cores.
2. Assign each data point to the closest core (use hamming distance).
3. Compute the new cores for the clusters by using the most frequent observations of the all data
points that belong to each cluster.
4. Repeat step 2-3 until no change in ()
=1(needs to be computed with hamming
distance)
The above procedure needs to be replicated several times with random initial starts and the one that
gives the minimum ()
=1, is chosen to be the output.
Function Parameters:
X: This is the data set that needs to be a numpy array.
number_cluster: Number of clusters
replication_number: This is the parameter that tells the number of times the algorithm is replicated
(in class example we had 10).
epsilon: A small number to stop the replication when the difference of the previous cost and current
cost is below of (Remember that we selected 0.01 in class)
Function Outputs:
best_cost: This is the minimum cost among many replications.
best_cluster: This is the best cluster among many replications, needs to be a numpy array.
best_core: This is the best cores among many replications, needs to be a numpy array.
Instructions:
Two sample data sets are provided as HW6TestSet1.csv and HW6TestSet2.csv and sample solution
are given below. Please note that labels representing the clusters might be a bit different from what
you have seen here, but the cost should be similar. Please use the template uploaded to the Ninova,
read the comments in the template and strictly follow the instructions. Do the necessary testing of your
function in a python file other than the python file your function is written. Use the
initial_cores(X,number_cluster) function provided in the template to make your initial cluster
assignments!!! Do not write your own initialization function as it will produce different results. Also
do not change the random number seed set in the k_core_clustering function"

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Moving Objects Databases

Authors: Ralf Hartmut Güting, Markus Schneider

1st Edition

0120887991, 978-0120887996

More Books

Students also viewed these Databases questions

Question

1. How can evolutionary theory explain prosocial behaviour?

Answered: 1 week ago