Question: For this final project, you will use what we have learned this semester to implement a simple, but popular, machine learning technique known as K

For this final project, you will use what we have learned this semester to implement a simple, but popular, machine learning technique known as

-

means clustering

(

KMC

) .

KMC is designed to solve problems similar to that inherent in the Iris dataset. You have a number of measurements that correspond to

features

of a phenomenon, and you want to detect structure in this data by identifying different

clusters

within data. Presumably these different clusters correspond to different types or categories of the phenomenon, in the way that different groups of measurements in the Iris dataset correspond to different types of flower. The K

-

means clustering

algorithm essentially uses repeated averaging of subsets of data to try to learn the different groupings present, thereby identifying distinct types within the data corresponding to these groupings.

More specifically, KMC works as follows. Start the process by pulling an arbitrary set of

points from the dataset, and assume those correspond to the

center

different clusters that each correspond to a particular type

/

category in the data

(

they don

t of course, at least not initially, but that

s what the learning is for

) .

Call these centers

,

.

The

learning

process updates these initial center locations as follows:

1)

find all the data points that are closer to

than to any other center, and compute the average of these data points; call this

.

Repeat this for each of the other centers, thereby calculating

.

Then

2)

replace the old

centers with the corresponding computed

.

This averaging

/

updating process is repeated either for a specified number of training steps, or else until the difference between

and

is sufficiently small

(

this later means that the most recent learning cycle produced little to no change to the center locations

) .

After this, the learned center locations can be used to classify previously unseen datapoints by calculating the center they are closest to

.

And that is it

.

Simple, but surprisingly effective at detecting groupings within large datasets and associating them with a unique type. Now to fill in some important details: to measure

closeness

of a data point to a center, we will use the vector norm. To run the algorithm, it is usually necessary to specify:

How many centers

you want to use

The fraction of the data set you want to use for training, reserving the remainder for later testing to evaluate the quality of the predictions the trained model provides

A stopping criterion for the training

Thus, our learning function that accomplishes the training will take these three parameters as inputs; to keep things simple, we

ll assume the stopping criterion is simply an integer specifying how many training cycles should be used.

Typically, the data is represented as

(

potentially high dimensional

)

vectors, with one vector for each data point. This representation is computationally convenient since then the training steps above correspond to simple vector arithmetic. Fortunately, we have already written C

+ +

code that lets us do vector arithmetic the way we can in MATLAB using the Vector structure code

(

or my

_

array

),

and now the new extension of this to a class. This Vector class will form the foundation for our program, giving us the mathematical code needed to organize the data and implement the training algorithm.

But we will also need a class to organize all the steps, and data, of the training process! To that end, I

ve provided in the template kmc

.

cpp file a KMC class and a main function that uses this class. The details of the code for the KMC class I leave up to you, but some general thoughts follow.

General Thoughts

The constructor must allocate space for the data array, making sure that everything is properly initialized. The data will be pulled from a file specified as the input to the constructor. The first line of this file will contain a header with

2

numbers: the first specifies the number of datapoints in the file, and the second specifies the dimension of each datapoint. The file will thus have a number of rows specified by the first number in the header, and a number of columns specified by the second. Once the data array has been initialized consistent with the header information in the file, the KMC constructor can pull all the data from the file into this array, and store the number and dimension in the numdata and dim fields.

The train function will receive the three inputs discussed above, specifying how the training is to be accomplished. The first input is the number of training cycles to use, and the third is the number of centers to use in the clustering

/

classification analysis

note that train will have to allocate and properly initialize the c and nc fields in the KMC class consistent with the specified number of centers and the dimension of the datapoints. For the starting values of the centers, use the first nc points

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Read the above passage and then answer short questions Summarize and elaborate the research method of this article in concise language Application Research Based on Machine Learning in Network...

Read the above passage and then answer short questionsplease use 1,2,3,4 to write a simple and clear overview of the steps for the research process of this article, a hand-drawn chart is better....

Read the above passage and then answer short questionsWhat can be improved about the research method of this paper, that is, where is the gap? Application Research Based on Machine Learning in...

Read the above passage and then answer short questionsWhat is the research tool or platform used in this paper? Application Research Based on Machine Learning in Network Privacy Security Abstracts...

Read the above passage and then answer short questionsThe research method of this paper can be further upgraded and changed. Could you give a general explanation? Application Research Based on...

\fSUPPORT FOR THE CULTURAL DIVERSITY AND POLICING (CDAP) PROJECT AND ALL OF ITS PRODUCTS HAVE BEEN PROVIDED BY THE BUREAU OF JUSTICE ASSISTANCE GRANT #2001-DD-BX-K003. OPINIONS STATED IN THIS PAPER...

Please read the question Question : What are "spaced practice", "varied practice", and "interleaved practice"? Give a definition for each. Then give an example of each from your own experience as a...

Factors encouraging cycle commuting in Scotland Big Data Fundamentals Coursework Daniel Devine CS982 Big Data Technologies Computer and Information Sciences University of Strathclyde, Glasgow 5th...

The following is a summary of some of the major phases, or steps, of the data analytics life cycle: Business understanding This phase is also known as the discovery phase. During this phase, an...

The market value of a firms assets is $3 billion. If the market value of the firms liabilities is $4.6 billion, what is the market value of the stockholders investment? A) $2.4 billion B) $1.4...

Are you more organized than the person sitting beside you in class? If not, what problems could that present in terms of your performance and rank in the class? How analogous is this situation to...

Statement of Canh Flows Under The Direct Method Finascial atatements for McDowell Cempany as mell as adobonal intarsubon neweet le caih fows foring the perwad are uhown. \ table [ [ McBumell Cempany...

Describe what the occurrence-based policy liability insurance coverage means. Accusations can be reported at any time in or out of the coverage year. When the incident occurs, the accusation has 1...