Question

1 Approved Answer

Posted on Oct 16, 2024

Exercise 8 (last exercise!): Genre features (4 points) You now have the building blocks you need to inspect some simple features of each cluster, to

Exercise 8 (last exercise!): Genre features (4 points)

You now have the building blocks you need to inspect some simple features of each cluster, to see how "distinct" clusters are from one another. Let's do that by analyzing the top genres represented in each cluster.

First, recall that you previously identified the largest clusters:

In[]:

top_k_ebook_labels

{0, 15, 25, 26, 29}

Next, recall that you assigned e-books to clusters:

In[]:

labeled_metadata_df.head(5)

ebookgenrelabel

0 B000FA64PK Literature & Fiction 20

1 B000FA64QO Science Fiction & Fantasy 20

2 B000FBFMVG Literature & Fiction 20

3 B000JMKNQ0 Literature & Fiction 20

4 B000JMKX4W Literature & Fiction 15

Your task.Complete the function,calc_top_genres(labeled_metadata, top_labels), below. It takes as input two objects:

labeled_metadata: A pandas dataframe, formatted likelabeled_metadata_dfabove.
top_labels: A Python set of labels, liketop_k_ebook_labelsabove.

It should then do the following:

For each label intop_labels, it should determine thetwomost frequently occurring genres among the e-books with that label.
It should then return a single dataframe with two columns,'label'and'genre'. Each row should correspond to one (label, genre) pair. And per the preceding bullet, you expect to see two rows per label.

Regarding the number of rows per label, there are two exceptions. First, if a given label only has e-books from one genre, then there will only be one row. Second, if there are ties, then you should retain all pairs, in the same way you would have done in Exercise 6.

Note 0:Your function mustnotmodify the input arguments. The test cell will check for that and may fail with strange errors if you do so.

Note 1:The order of rows does not matter, as the test cell will use tibble comparison functions.

Example.A correct implementation will produce, for the callcalc_top_genres(labeled_metadata_df, top_k_ebook_labels)on the input dataset, the following result:

labelgenre0Literature & Fiction0Romance15Children's eBooks15Literature & Fiction25Health, Fitness & Dieting25Literature & Fiction26Science Fiction & Fantasy26Literature & Fiction29Literature & Fiction29Romance

Though not definitive, this result does suggest that the clustering captures distinct groups of books, here, for instance, with a cluster having'Romance'novels (label 0) being distinct from a cluster with'Children's eBooks'(label 15) and from another with'Health, Fitness & Dieting'(label 25), for instance.

In[]:

def calc_top_genres(labeled_metadata, top_labels): ### ### YOUR CODE HERE ###

Exercise 8 (last exercise!): Genre features (4 points) You now have the building blocks you need to inspect some simple features of each cluster, to see how "distinct" clusters are from one another. Let's do that by analyzing the top genres represented in each cluster. First, recall that you previously identified the largest clusters: In [ ]: top_k_ebook_labels Next, recall that you assigned e-books to clusters: In [ ]: labeled_metadata_df . head(5) Your task. Complete the function, calc_top_genres (labeled_metadata, top_labels), below. It takes as input two objects: labeled_metadata: A pandas dataframe, formatted like labeled_metadata_of above. top_labels: A Python set of labels, like top_k_ebook_labels above. It should then do the following: . For each label in top_labels, it should determine the two most frequently occurring genres among the e-books with that label. . It should then return a single dataframe with two columns, "label' and 'genre'. Each row should correspond to one (label, genre) pair. And per the preceding bullet, you expect to see two rows per label. Regarding the number of rows per label, there are two exceptions. First, if a given label only has e-books from one genre, then there will only be one row. Second, if there are ties, then you should retain all pairs, in the same way you would have done in Exercise Note 0. Your function must not modify the input arguments. The test cell will check for that and may fail with strange errors if you do so. Note 1: The order of rows does not matter, as the test cell will use tibble comparison functions. Example. A correct implementation will produce, for the call calc_top_genres (labeled_metadata_df, top_k_ebook_labels) on the input dataset, the following result: label genre 0 Literature & Fiction 0 Romance 5 Children's eBooks 15 Literature & Fiction Health, Fitness & Dieting 5 Literature & Fiction 26 Science Fiction & Fantasy 26 rature & Fiction 29 Literature & Fiction 29 Romance Though not definitive, this result does suggest that the clustering captures distinct groups of books, here, for instance, with a cluster having ' Romance' novels (label 0) being distinct from a cluster with 'children's eBooks' (label 15) and from another with 'Health, Fitness & Dieting' (label 25), for instance. In [ ]: def calc_top_genres (labeled_metadata, top_labels) : ### ### YOUR CODE HERE