Answered step by step
Verified Expert Solution
Question
1 Approved Answer
The data represent the log - transformed Mel spectrograms derived from the GTZAN dataset. The original GTZAN dataset contains 3 0 - second audio files
The data represent the logtransformed Mel spectrograms derived from the GTZAN dataset. The original GTZAN dataset contains second audio files of songs associated with different genres per genre We have reduced the original data to genres songs and transformed it to obtain, for each song, logtransformed Mel spectrograms. Each Mel spectrogram is an image file which describes the time, frequency and intensity of a song segment. In particular, the xaxis represents time, the yaxis is a transformation of the frequency to log scale and then the socalled mel scale and the color of a point represents the decibels of that frequency at that time with darker colours indicating lower decibels Here you can see an example of Mel spectrogram x and y ticks identify the pixels making up the picture:
alt text
The training data represent approximately of the total number of data points, the validation set and test set
The labels of the classes are such that:
the first class corresponds to classical music
the second to disco music
the third to metal music
the fourth to rock music
P CNN
For this exercise, you must use the CPU runtime.
The goal is to train a CNNbased classifier on the Mel spectrograms to predict the corresponding music genres. Implement the following CNN architecture:
a D convolutional layer with channels of squared filters of size padding, default stride, ReLU activation function and default weight and bias initialisations.
a D max pooling layer with size and stride
a D convolutional layer with channels of squared filters of size padding, default stride, ReLU activation function and default weight and bias initialisations.
a D max pooling layer with size and stride
a D convolutional layer with channels of squared filters of size padding, default stride, ReLU activation function and default weight and bias initialisations.
a D max pooling layer with size and stride
a layer transforming the output filters to a D vector.
a dense layer made of neurons, ReLU activation and L regularisation with a penalty of
an output layer with the required number of neurons and activation function.
Compile the model using an appropriate evaluation metric and loss function. To optimise, use the minibatch stochastic gradient descent algorithm with batch size Train the model for epochs.
IMPORTANT: For reproducibility of the results, before training your model you must run the following lines of code to fix the seeds:
tfkeras.utils.setrandomseed
tfconfig.experimental.enableopdeterminism
Answer to the following questions:
How many parameters does the model train? Before performing the training, do you expect this model to overfit? Which aspects would influence the overfitting or not of this model?
Plot the loss function and the accuracy per epoch for the train and validation sets.
Which accuracy do you obtain on the test set?
Using the function plotconfusionmatrix plot the confusion matrices of the classification task on the train set and test set. What do you observe from this metric? Which classes display more correct predictions? And wrong?
Using the function indcorrectuncorrect extract the indexes of the training data that were predicted correctly and incorrectly, per each class. For each music genre, perform the following steps:
Using the function plotspectrograms plot the mel spectrograms of the first data points which were predicted correctly and the first which were predicted wrongly. Do you observe some differences among music genres?
Using the function printwrongprediction print the predicted classes of the first data points which were predicted wrongly.
Using the GradCAM method, implemented in the function plotgradcamspectrogram, print the heatmaps of the last pooling layer for the same extracts correct wrong Comment on the heatmaps obtained. Do you observe differences among the heatmaps of different music genres? Can you understand why the model got some predictions wrong?
Comment on the previous question: what are your thoughts about the applicability of the GradCAM tool on these data?
P Disentangling time and frequency
The images we are using in this assignment are different from a usual picture: the x and y axes carry different meanings. With the tools we are exploring during lectures and seminars, can you propose a CNN architecture that takes into account differently the time and frequency components of the spectrograms?
Present and describe the architecture you have chosen and justify the rationale behind it Plot training and validation loss and accuracy over epochs this time you can use the GPU runtime if the model is slow to train Print the accuracy on the test set and the confusion matrices on the training and test sets.
Data:
xtest
xtrain
xval
ytest
ytestnum
ytrain
ytrainnum
yval
yvalnum
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started