Question

1 Approved Answer

Posted on Jun 24, 2024

4)Since there is no natural choice for the number of clusters here, let's choose k=3. Run kmeans on KROGER.SCALED with centers=3, iter.max=30, and nstart=25, left-arrowing

4)Since there is no natural choice for the number of clusters here, let's choose k=3. Run kmeans on KROGER.SCALED with centers=3, iter.max=30, and nstart=25, left-arrowing the results in KMEANS.

a.Print to the screen the contents of round(KMEANS$centers,digits=2) to see the locations of the cluster centers and table(KMEANS$cluster) to get a frequency table of how many individuals are in each cluster. Note: the labels 1/2/3 of the cluster identities may change every time you run kmeans, but you should always end up with the same values for the cluster centers.

KMEANS <- kmeans(KROGER.SCALED, center = 3, iter.max = 30, nstart = 25)

round(KMEANS$centers, digits = 2)

##ALCOHOLBABY COOKING DRINKS FRUITVEG GRAIN HEALTH HOUSEHOLDMEAT OTHERPET PREPARED SNACKS

## 10.450.430.660.690.590.630.550.620.580.600.470.650.67

## 2-1.04 -0.57-1.97-1.74-1.86 -1.89-1.55-1.81 -1.81 -1.73 -0.85-1.86-1.77

## 3-0.26 -0.34-0.28-0.35-0.22 -0.26-0.25-0.26 -0.23 -0.27 -0.32-0.28-0.33

table(KMEANS$cluster)

##123

## 938 194 868

b.Kroger is interested in using the cluster identities to aid in identifying segments where customized offers could be designed (e.g., people who cook, people with pets, people with babies, etc.). While the clustering scheme you just found is valid from a technical, algorithmic point of view, the end result is not very interesting, and definitely not useful for Kroger's application. Determine how the each cluster differs from each other, then explain why Kroger would not find this clustering scheme useful. This is by far the most important question on this homework.

Response:

Grading: (3 pts total)

5)For targeted advertising, it probably makes more sense to cluster on the fraction of the total money spent by the customer on each of the categories (instead of the raw amount). If we find a segment that spends a much larger fraction of their shopping budget on baby items, we can target them with baby-specific promotions, etc.

Copy KROGER (whose contents shouldn't have been modified since the data was read in) into a data frame called FRACTION. Then, code for loop that defines the values in row i of FRACTION to be the fractional amounts of the values in the ith row of KROGER. For example if x is a vector of the 13 dollar amounts, then x/sum(x) would be a vector giving these 13 fractional amounts.

Verify that the sum of each row of FRACTION is 1 (i.e., print to the screen the result of running summary(apply(FRACTION,1,sum)), which translated into English means "summarize the row totals of each row of the FRACTION dataframe"), then NULL out the OTHER column from FRACTION (one of the 13 columns is now redundant since the values in a row add to 1, so might as well get rid of the least interesting one), then left-arrow scale_dataframe( log10(FRACTION+0.01) ) into FRACTION.SCALED and provide a summary of FRACTION.SCALED$COOKING.

#Sanity check for number of columns of FRACTION.SCALED to make sure OTHER is nulled out

#dim(FRACTION.SCALED)

#200012

#Sanity check for summary of COOKING column of FRACTION.SCALED

#Min. 1st Qu.MedianMean 3rd Qu.Max.

#-4.4567 -0.41740.16120.00000.60633.3074

Grading: 2 pts; if this one is wrong, the following problems will end up being wrong as well.

6)Instead of running kmeans to get clusters, let's try hierarchical clustering this time. Run hclust with arguments dist(FRACTION.SCALED) and method="ward.D2" (my favorite way of measuring distance between clusters). Provide a plot of the dendrogram.

hierarchy <- hclust(dist(KROGER.SCALED), method = "ward.D2")

plot(hierarchy)

Grading: 2 points

7)There's no extremely long uninterrupted set of vertical lines, so there's no obvious, natural choice for the number of clusters that reflects underlying structure. When this is the case, we choose a small value for the number of clusters and interpret them. Then, we add a cluster and see if the new cluster adds to our understanding of the problem, etc.

Left-arrow FRACTION.SCALED into FRACTION.SCALED.WITH.ID and add columns k3, k4, and k5 to it which contain the cluster identities (found from cutree) when 3, 4, or 5 clusters are found.

Using aggregate, find the average value of each column in FRACTION.SCALED.WITH.ID broken down by k3, again but broken down by k4, and again broken down by k5 (e.g.aggregate(.~k3,data=FRACTION.SCALED.WITH.ID,FUN=mean), etc.). Put the aggregate command inside a round() function and print these averages to the screen to 2 digits.

#Sanity check:one row you'll see when breaking it down by k3

#k3 ALCOHOLBABY COOKING DRINKS FRUITVEG GRAIN HEALTH HOUSEHOLDMEATPET PREPARED SNACKSk4k5

#110.08 -0.35-0.140.04-0.10 -0.010.270.23 -0.180.17-0.040.12 1.52 1.55

Grading: 2 points

8)Looking at the centers for the three-cluster scheme, the only obvious cluster-defining characteristic is BABY (cluster 3 has a value of 2.02, wow!), though it's clear clusters 1 and 2 are distinct groups since the signs of almost all columns are opposite. Look at the centers for the four-cluster scheme. Identify which cluster (1, 2, or 3) has been "split in two", and comment on whether our understanding of the households increases when going from 3 to 4 clusters. At this point, we'd try out 5 clusters, characterize the new clusters that emerged, determine if they are useful, etc.,

Response:

Grading: 2 pts

9)Kroger finds the five cluster scheme to be most interesting and useful. Characterize each of the 5 clusters with a short, meaningful description (e.g., fast-food junkies who spend most of their money on snacks and prepared food). Clusters 1 and 2 don't have any obvious single-variable defining characteristics (nothing less than -1 or greater than 1). Cluster 1 looks like your "average" shopper (about average in every category). For Cluster 2, look for patterns across variables for what set of variables are "a bit above average" and "a bit below average" and tell a story. In tandem to interpreting the output from (7), also left-arrow KROGER into KROGER.HC, then add the cluster identities in the 5 cluster scheme in a column named k5, then use aggregate to look at the median value for each category, rounded to the nearest dollar.

Cluster 1 - "average shopper" (no real cluster-defining characteristics)

Cluster 2 -

Cluster 3 -

Cluster 4 -

Cluster 5 -

Grading: 2 pts