Question
In this assignment, you are asked to implement the k-means algorithm based on the following pseudocode. STEP0: Randomly create initial centers While (total distance is
In this assignment, you are asked to implement the k-means algorithm based on the following pseudocode.
STEP0: Randomly create initial centers
While (total distance is updated){
Step1: Assign each data point to the closest center (cluster)
Step2: Update cluster centers
Step2a: clean up the arrays for the current centers
Step2b: calculate the new centers Step3: calculate current clusters total distance
Step4: determine if the total distance is updated
}
Two java source codes are provided with the assignment.
(1) kmeans.java: the core java code for k-means algorithm. This code is incomplete and you need to fill out two parts: Step1 and Step2b. All other steps are complete and you dont need to modify or update.
(2) KmeansRun.java: it reads data from text file and executes the k-means algorithm. You dont need to change this code. Also, you dont need to submit this code.
kmeans.java:
import java.util.Arrays;
public class kmeans {
double[][] centers; // center of the clusters, centers[c][0],
centers[c][1], ..., centers[c][numColumns] are the centers of Cluster c
int numClusters; // number of clusters
double[][] data;
int numRows; // number of rows
int numColumns; // number of columns
int[] clustersInfo; // cluster information for each data point. if
clusterInfo[i] == c, then data point i belongs to cluster c
int kmeans_num_iter; // count the current iteration number of k-
means algorithm
final double Epsilon = 0.0001; // tolerance for comparing two double
numbers
public void kmeans_algorithm(int number_of_clusters, double[][]
data) {
kmeans_num_iter = 0;
numClusters = number_of_clusters;
this.data = data;
numColumns = data[0].length;
numRows = data.length;
centers = new double[numClusters][numColumns];
clustersInfo = new int[numRows];
//Step0: Initial Centers are set: the first numClusters
data points
for(int c=0; c centers[c] = Arrays.copyOf(data[c], data[c].length); } double Current_totalDist = Double.MAX_VALUE; double Prev_totalDist = Double.MAX_VALUE; boolean continueAlgorithm = true; while(continueAlgorithm) { kmeans_num_iter++; // Step1: Assign to the closest centers. Use calculateDistance method to calculate the distance. .... // Step2: Update cluster centers // Step2a: first initializing the current array for centers for(int c=0; c Arrays.fill(centers[c], 0); // Delete the current values } // Step2b: Write code to calculate the new centers .... // Step3: Update total distance Prev_totalDist = Current_totalDist; // before updating the total distance, save it as prev_totalDist Current_totalDist = calculateTotalDistance(); System.out.println("Iteration " + kmeans_num_iter + ": total distance = " + Current_totalDist); // Step4: terminate algorithm if the improvement in total distance between two consecutive iterations is less than Epsilon if(Prev_totalDist - Current_totalDist < Epsilon) { continueAlgorithm = false; } } System.out.println(); } // Calculating the distance between two data points. You can assume array a is the data point and array b is the center. public double calculateDistance(double[] a, double[] b) { double dist = 0; for(int j=0; j dist += Math.pow(a[j]-b[j], 2); } return dist; } // Returns the total distance of your current clusters solution. public double calculateTotalDistance() { double totalDist = 0; for(int i=0; i totalDist += calculateDistance(data[i],centers[clustersInfo[i]]); } return totalDist; } } KmeanRun.java: import java.io.*; import java.util.*; /** * Code for Extra Credit Assignment * @author */ public class KmeansRun { public static void main(String[] args) throws FileNotFoundException { // TODO Auto-generated method stub String filename = "customers_std.txt"; double[][] data = new double[100][4]; int cnt = 0; Scanner inFile = new Scanner(new File(filename)); while (inFile.hasNextLine()) { String str = inFile.nextLine(); Scanner lineScanner = new Scanner(str); double age = Double.parseDouble(lineScanner.next()); double salary = Double.parseDouble(lineScanner.next()); double married = Double.parseDouble(lineScanner.next()); double numVisit = Double.parseDouble(lineScanner.next()); data[cnt][0] = age; data[cnt][1] = salary; data[cnt][2] = married; data[cnt][3] = numVisit; cnt++; lineScanner.close(); } inFile.close(); int numClusters = 4; kmeans km = new kmeans(); km.kmeans_algorithm(numClusters, data); System.out.println("Market Segmentation Analysis by Kmeans: Your Output"); for(int c = 0; c System.out.print("Cluster " + c + ":"); System.out.print("age = " + Math.round(km.centers[c][0]*14.95 + 45.74) + ", "); System.out.print("salary = " + Math.round(km.centers[c][1]*65032 + 173750) + ", "); System.out.print("married = " + Math.round(km.centers[c][2]*0.423 + 0.77) + ", "); System.out.println("numVisit = " + Math.round(km.centers[c][3]*14.95 + 45.74)); } System.out.println(); System.out.println("Expected Output"); System.out.println("Cluster 0:age = 62, salary = 204600, married = 1, numVisit = 65"); System.out.println("Cluster 1:age = 39, salary = 167492, married = 1, numVisit = 36"); System.out.println("Cluster 2:age = 60, salary = 213420, married = 0, numVisit = 33"); System.out.println("Cluster 3:age = 29, salary = 123603, married = 0, numVisit = 39"); } } To help you test your algorithm, I provided a sample dataset customer_std.txt. This dataset contains customer information about a store. There are total 100 customers, where each row is for a customer. There are four columns. The first column is the age of the customer, the second column is the salary of the customer, the third column indicates if the customer is married (1 = married, 0 = single), and the four column is the number of visits to the store. Because the columns were standardized to have zero mean and one standard deviation, you wont be able to interpret directly looking at the data. To obtain the full points credits, your code should be able to generate correct output for a new data file. You can assume the format will remain the same: four columns in the same order; each row is for a customer. However, the data matrix and the number of customers (rows) can be different for the new data file. customer_std.txt: 0.82 1.42 0.54 0.88 - 1.59 -0.41 0.54 0.02 1.29 1.39 -1.82 -0.59 - 1.05 0.77 -1.82 -0.07 1.62 -0.46 0.54 0.88 0.82 1.79 -1.82 -1.11 - 1.05 0.87 0.54 0.36 0.89 -0.27 0.54 1.05 0.75 0.96 0.54 1.48 - 0.38 -0.07 0.54 -1.11 - 0.45 0.55 0.54 -0.68 0.02 0.79 0.54 -1.11 - 0.25 1.26 0.54 -1.11 - 1.12 -0.97 -1.82 0.19 1.22 -0.30 0.54 1.23 - 0.52 -0.68 0.54 -0.68 - 0.25 -0.60 0.54 -0.68 0.62 0.10 0.54 1.23 - 1.25 -0.30 0.54 0.10 0.95 -1.05 0.54 1.48 0.55 -1.16 0.54 -0.85 1.15 2.27 0.54 1.92 - 1.05 -1.27 0.54 -0.85 - 0.45 -0.09 -1.82 -0.50 - 1.25 -1.77 -1.82 -0.50 0.02 -1.48 0.54 -0.94 - 0.05 0.82 0.54 -1.19 - 1.59 -0.82 0.54 -0.42 1.62 -1.01 -1.82 -1.19 1.49 -0.10 0.54 1.40 0.75 0.62 0.54 1.40 0.28 -0.18 0.54 -1.11 - 0.65 -0.73 0.54 -1.11 0.15 -1.29 0.54 -0.94 1.49 2.37 0.54 2.52 0.82 0.62 0.54 1.40 1.49 1.46 0.54 0.97 - 0.25 1.09 0.54 -1.02 - 0.72 -1.39 -1.82 -1.02 0.55 0.37 0.54 1.57 0.62 -0.89 0.54 1.40 0.02 -1.20 0.54 -0.94 0.95 -1.22 0.54 -0.94 - 0.12 -1.51 0.54 -0.94 - 1.25 -0.82 0.54 -0.33 - 1.12 -1.25 -1.82 -1.02 0.82 -0.27 0.54 0.88 - 0.25 0.01 0.54 -0.94 - 1.39 -1.27 -1.82 -1.11 (there are a lot more numbers, but here is a sneak peak what the file looks like)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started