Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Can someone help with me data clustering and the k means algorithm. However, I'm not able to list all of the data sets but they

Can someone help with me data clustering and the k means algorithm. However, I'm not able to list all of the data sets but they include: ecoli.txt, glass.txt, ionoshpere.txt, iris_bezdek.txt, landsat.txt, letter_recognition.txt, segmentation.txt vehicle.txt, wine.txt and yeast.txt.

image text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribed

Input: Your program should be non-interactive (that is, the program should not interact with the user by asking him/her explicit questions) and take the following command-line arguments: , where F: name of the data file K: number of clusters (positive integer greater than one) I: maximum number of iterations (positive integer) T: convergence threshold (non-negative real) R: number of runs (positive integer) . . . Warning: Do not hard-code any of these parameters (e.g., using the statement "int R -100;") into your program. The values of these parameters should be given by the user at run-time through the command prompt. See below for an example A run' is an execution of k-means until it converges. The first line of F contains the number of points (N) and the dimensionality of each point (D). Each of the subsequent lines contains one data point in blank separated format. In your program, you should represent the attributes of each point using a double-precision floating point data type (e.g., "double" data type in C/C++/Java). In other words, you should not use an integral data type (byte, short, char, int, long, etc.) to represent an attribute The initial cluster centers should be selected randomly from the data set. A run should be terminated when the number of iterations reaches I or the relative improvement in SSE (Sum-of-Squared Error) between two consecutive iterations drops below T, that is, whenever (SSE(t 1)- SSE(t) SSE(t-1) , where F: name of the data file K: number of clusters (positive integer greater than one) I: maximum number of iterations (positive integer) T: convergence threshold (non-negative real) R: number of runs (positive integer) . . . Warning: Do not hard-code any of these parameters (e.g., using the statement "int R -100;") into your program. The values of these parameters should be given by the user at run-time through the command prompt. See below for an example A run' is an execution of k-means until it converges. The first line of F contains the number of points (N) and the dimensionality of each point (D). Each of the subsequent lines contains one data point in blank separated format. In your program, you should represent the attributes of each point using a double-precision floating point data type (e.g., "double" data type in C/C++/Java). In other words, you should not use an integral data type (byte, short, char, int, long, etc.) to represent an attribute The initial cluster centers should be selected randomly from the data set. A run should be terminated when the number of iterations reaches I or the relative improvement in SSE (Sum-of-Squared Error) between two consecutive iterations drops below T, that is, whenever (SSE(t 1)- SSE(t) SSE(t-1)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

outline some of the current issues facing HR managers

Answered: 1 week ago