Question

1 Approved Answer

Posted on Sep 21, 2024

Can someone help with me data clustering and the k means algorithm. However, I'm not able to list all of the data sets but they

Can someone help with me data clustering and the k means algorithm. However, I'm not able to list all of the data sets but they include: ecoli.txt, glass.txt, ionoshpere.txt, iris_bezdek.txt, landsat.txt, letter_recognition.txt, segmentation.txt vehicle.txt, wine.txt and yeast.txt.

image text in transcribed

Input: Your program should be non-interactive (that is, the program should not interact with the user by asking him/her explicit questions) and take the following command-line arguments: , where F: name of the data file K: number of clusters (positive integer greater than one) I: maximum number of iterations (positive integer) T: convergence threshold (non-negative real) R: number of runs (positive integer) . . . Warning: Do not hard-code any of these parameters (e.g., using the statement "int R -100;") into your program. The values of these parameters should be given by the user at run-time through the command prompt. See below for an example A run' is an execution of k-means until it converges. The first line of F contains the number of points (N) and the dimensionality of each point (D). Each of the subsequent lines contains one data point in blank separated format. In your program, you should represent the attributes of each point using a double-precision floating point data type (e.g., "double" data type in C/C++/Java). In other words, you should not use an integral data type (byte, short, char, int, long, etc.) to represent an attribute The initial cluster centers should be selected randomly from the data set. A run should be terminated when the number of iterations reaches I or the relative improvement in SSE (Sum-of-Squared Error) between two consecutive iterations drops below T, that is, whenever (SSE(t 1)- SSE(t) SSE(t-1) , where F: name of the data file K: number of clusters (positive integer greater than one) I: maximum number of iterations (positive integer) T: convergence threshold (non-negative real) R: number of runs (positive integer) . . . Warning: Do not hard-code any of these parameters (e.g., using the statement "int R -100;") into your program. The values of these parameters should be given by the user at run-time through the command prompt. See below for an example A run' is an execution of k-means until it converges. The first line of F contains the number of points (N) and the dimensionality of each point (D). Each of the subsequent lines contains one data point in blank separated format. In your program, you should represent the attributes of each point using a double-precision floating point data type (e.g., "double" data type in C/C++/Java). In other words, you should not use an integral data type (byte, short, char, int, long, etc.) to represent an attribute The initial cluster centers should be selected randomly from the data set. A run should be terminated when the number of iterations reaches I or the relative improvement in SSE (Sum-of-Squared Error) between two consecutive iterations drops below T, that is, whenever (SSE(t 1)- SSE(t) SSE(t-1)