Question

1 Approved Answer

Posted on Sep 23, 2024

Part II. Clustering Analysis in R Download utility.csv data file and the R file from Google drive link of the file : https://drive.google.com/open?id=1dZSlyTUIGlWsKoCENDIrCdovcRFs7xx8 # Import

Part II. Clustering Analysis in R

Download "utility.csv" data file and the R file from Google drive

link of the file: https://drive.google.com/open?id=1dZSlyTUIGlWsKoCENDIrCdovcRFs7xx8

# Import it into R # Write your code below: utility <- read.csv() #---------------------------------------------------------------------------------#

# Important README: the utility dataset contains information of 22 utilities. The first # two columns are name and ID of each utility, and the other columns (x1 - x8) are numeric # characteristics of each utility.

# 1. The goal is to cluster the 22 utilities in this dataset. Given this goal, # should we include all the variables in clustering analysis, or are some of them # unnecessary? Why? # Answer the above question. And if you think some variables are unnecessary, # write code below to select only the necessary columns. If you think all variables # are necessary for clustering, you don't need to write any code utility_selected <- #---------------------------------------------------------------------------------#

# 2. Is there any need to normalize the data? Why or why not? # If you think normalization is needed, write your code below to do normalization on # the selected columns. Feel free to re-use the code from in-class exercise. # If you think normalization is not needed, you don't need to write any code.

utility_normalized <- #---------------------------------------------------------------------------------#

# 3. Get distance matrix. Use Manhattan distance # Write your code below: distance_matrix <- #---------------------------------------------------------------------------------#

# 4. Apply Hierarchical Clustering # Use Ward method to measure distance between clusters # Write your code below: hierarchical <- #---------------------------------------------------------------------------------#

# 4.a. Plot the dendrogram. No need to specify the "labels" parameter # How many clusters do you think is appropriate? # Write your code below:

#---------------------------------------------------------------------------------#

# 4.b. Based on your answer to the last question, mark the cluster solutions on dendrogram # That is, if you think there are X clusters, then mark the X-cluster solution # Write your code below:

#---------------------------------------------------------------------------------#

# 5. Now, apply K-Means clustering

# 5.a. What is the most natural number of clusters in this data? # To answer this question, plot the WSSE curve as we did at the in class exercise, # then explain how you find the most natural number of clusters. # Feel free to re-use the code from in-class exercise. # Write your code below:

#---------------------------------------------------------------------------------#

# 5.b Use the cluster number that you used in the last question # Write your code below: kcluster <- #---------------------------------------------------------------------------------#

# 5.c. Report cluster centroids, and interpret each cluster in your own words # Note that you do not have to differentiate the clusters on every single variable. # Rather, try to describe each cluster by its most distinguishable characteristics. # It is useful to know the meaning of each variable, here they are: # x1: Fixed - charge covering ration (income/debt) # x2: Rate of return on capital # x3: Cost per KW capacity in place # x4: Annual Load Factor # x5: Peak KWH demand growth from 1974 to 1975 # x6: Sales (KWH use per year) # x7: Percent Nuclear # x8: Total fuel costs (cents per KWH) # Write your code below:

#---------------------------------------------------------------------------------#