Question
For this question you will need the data set below, and you will need to install the package ape in R. The data come from
For this question you will need the data set below, and you will need to install the package ape in R. The data come from a four-component Gaussian mixture on the plane. We want to see how well we can recover the four clusters using hierarchical clustering methods as well as K-means. The basic commands to plot the dendogram using single or complete linkage are:
library(ape) d = dist(data from file) #or dist(scale(data from file)) if the data are first scaled clust = hclust(d,"single") #or hclust(d,"complete") plot(clust, main = "put title here", hang = -1, cex = .8,xlab = "", ylab = "", sub = "", axes = FALSE)
(a) Make a scatter plot of the data and identify the four clusters. (b) In one page, put the four plots of dendograms corresponding to: single-linkage and scaled data, complete-linkage and scaled data, single-linkage and unscaled data, and complete-linkage and unscaled data. Specify what plots they are using the title in each dendogram. What four clusters do you get with each dendogram? (c) Perturb the data matrix by adding zero-mean Gaussian noise to each of the columns. To column j add noise with variance 0.1*(sample variance of column j). In one page, put the four plots of dendograms corresponding to: single-linkage and scaled data without noise, complete-linkage and scaled data without noise, single-linkage and scaled noisy data, and complete-linkage and scaled noisy data. Specify what plots they are using the title in each dendogram. What four clusters do you get with each dendogram? To use K-means the basic command is: cl = kmeans(data, centers = number of clusters) and all the information you need is in the object cl. (a) Run K-means with four clusters three times and identify the clusters you get each time. You may get different answers, why? (b) Repeat (a) but using scaled data. (c) Make a plot of the ratio of between-sum-of-squares to total-sum-of-squares (that is, cl$betweenss/cl$totss ) as a function of number of clusters. Interpret the results.
dataset:
G X Y A1 0.1374 -0.9271 A2 0.3006 -0.5703 A3 0.0462 -0.5467 A4 0.8649 -0.2168 A5 -0.3043 -0.0842 A6 -0.3685 -0.1093 B1 1.2501 3.5413 B2 3.9105 3.3893 B3 3.8671 3.7512 B4 2.9201 4.7783 B5 3.8985 4.2231 B6 3.1837 1.7167 C1 10.1454 -1.1645 C2 10.0565 0.4510 C3 10.2200 -0.9178 C4 10.0508 0.0334 C5 11.3937 0.0177 C6 9.4167 1.1136 D1 -10.0000 5.0000 D2 -9.200 4.567
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started