Question: For this final project, you will use what we have learned this semester to implement a simple, but popular, machine learning technique known as K

For this final project, you will use what we have learned this semester to implement a simple, but popular, machine learning technique known as K-means clustering(KMC). KMC is designed to solve problems similar to that inherent in the Iris dataset. You have a number of measurements that correspond to features of a phenomenon, and you want to detect structure in this data by identifying different clusters within data. Presumably these different clusters correspond to different types or categories of the phenomenon, in the way that different groups of measurements in the Iris dataset correspond to different types of flower. The K-means clustering
algorithm essentially uses repeated averaging of subsets of data to try to learn the different groupings present, thereby identifying distinct types within the data corresponding to these groupings.
More specifically, KMC works as follows. Start the process by pulling an arbitrary set of
points from the dataset, and assume those correspond to the center of
different clusters that each correspond to a particular type/category in the data (they dont of course, at least not initially, but thats what the learning is for). Call these centers
,
. The learning process updates these initial center locations as follows: 1) find all the data points that are closer to
than to any other center, and compute the average of these data points; call this
. Repeat this for each of the other centers, thereby calculating
. Then 2) replace the old
centers with the corresponding computed
.
This averaging/updating process is repeated either for a specified number of training steps, or else until the difference between
and
is sufficiently small (this later means that the most recent learning cycle produced little to no change to the center locations). After this, the learned center locations can be used to classify previously unseen datapoints by calculating the center they are closest to.
And that is it. Simple, but surprisingly effective at detecting groupings within large datasets and associating them with a unique type. Now to fill in some important details: to measure closeness of a data point to a center, we will use the vector norm. To run the algorithm, it is usually necessary to specify:
How many centers
you want to use
The fraction of the data set you want to use for training, reserving the remainder for later testing to evaluate the quality of the predictions the trained model provides
A stopping criterion for the training
Thus, our learning function that accomplishes the training will take these three parameters as inputs; to keep things simple, well assume the stopping criterion is simply an integer specifying how many training cycles should be used.
Typically, the data is represented as (potentially high dimensional) vectors, with one vector for each data point. This representation is computationally convenient since then the training steps above correspond to simple vector arithmetic. Fortunately, we have already written C++ code that lets us do vector arithmetic the way we can in MATLAB using the Vector structure code (or my_array), and now the new extension of this to a class. This Vector class will form the foundation for our program, giving us the mathematical code needed to organize the data and implement the training algorithm.
But we will also need a class to organize all the steps, and data, of the training process! To that end, Ive provided in the template kmc.cpp file a KMC class and a main function that uses this class. The details of the code for the KMC class I leave up to you, but some general thoughts follow.
General Thoughts
The constructor must allocate space for the data array, making sure that everything is properly initialized. The data will be pulled from a file specified as the input to the constructor. The first line of this file will contain a header with 2 numbers: the first specifies the number of datapoints in the file, and the second specifies the dimension of each datapoint. The file will thus have a number of rows specified by the first number in the header, and a number of columns specified by the second. Once the data array has been initialized consistent with the header information in the file, the KMC constructor can pull all the data from the file into this array, and store the number and dimension in the numdata and dim fields.
The train function will receive the three inputs discussed above, specifying how the training is to be accomplished. The first input is the number of training cycles to use, and the third is the number of centers to use in the clustering/classification analysis note that train will have to allocate and properly initialize the c and nc fields in the KMC class consistent with the specified number of centers and the dimension of the datapoints. For the starting values of the centers, use the first nc points

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!