Question
Need help with a small music recommendation system project in python? Music Recommendation System Milestone 1 Problem Definition The context: Why is this problem important
Need help with a small music recommendation system project in python?
Music Recommendation System
Milestone 1
Problem Definition
The context: Why is this problem important to solve? The objectives: What is the intended goal? The key questions: What are the key questions that need to be answered? The problem formulation: What are we trying to solve using data science?
Data Dictionary
The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id, and the play count of users.
song_data
song_id - A unique id given to every song
title - Title of the song
Release - Name of the released album
Artist_name - Name of the artist
year - Year of release
count_data
user _id - A unique id given to the user
song_id - A unique id given to the song
play_count - Number of times the song was played
Data Source
http://millionsongdataset.com/
Important Notes
This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken to get a feasible solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem, and we encourage you to feel free and explore them as an 'optional' exercise.
In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.
Importing Libraries and the Dataset
In [ ]:
# Mounting the drive from google.colab import drive drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
# Used to ignore the warning given as output of the code import warnings warnings.filterwarnings('ignore') # Basic libraries of python for numeric and dataframe computations import numpy as np import pandas as pd # Basic library for data visualization import matplotlib.pyplot as plt # Slightly advanced library for data visualization import seaborn as sns # To compute the cosine similarity between two vectors from sklearn.metrics.pairwise import cosine_similarity # A dictionary output that does not raise a key error from collections import defaultdict # A performance metrics in sklearn from sklearn.metrics import mean_squared_error
Load the dataset
In [ ]:
# Importing the datasets count_df = pd.read_csv('/content/drive/MyDrive/ADSP/capstone/count_data.csv') song_df = pd.read_csv('/content/drive/MyDrive/ADSP/capstone/song_data.csv')
Understanding the data by viewing a few observations
In [ ]:
# See top 10 records of count_df data
In [ ]:
# See top 10 records of song_df data
Let us check the data types and and missing values of each column
In [ ]:
# See the info of the count_df data
In [ ]:
# See the info of the song_df data
Observations and Insights:_
In [ ]:
# Left merge the count_df and song_df data on "song_id". Drop duplicates from song_df data simultaneously # Drop the column 'Unnamed: 0'
Think About It: As the user_id and song_id are encrypted. Can they be encoded to numeric features?
In [ ]:
# Apply label encoding for "user_id" and "song_id"
Think About It: As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it contains users who have listened to a good count of songs and vice versa?
In [ ]:
# Get the column containing the users users = df.user_id # Create a dictionary from users to their number of songs ratings_count = dict() for user in users: # If we already have the user, just add 1 to their rating count if user in ratings_count: ratings_count[user] += 1 # Otherwise, set their rating count to 1 else: ratings_count[user] = 1
In [ ]:
# We want our users to have listened at least 90 songs RATINGS_CUTOFF = 90 # Create a list of users who need to be removed remove_users = [] for user, num_ratings in ratings_count.items(): if num_ratings < RATINGS_CUTOFF: remove_users.append(user) df = df.loc[ ~ df.user_id.isin(remove_users)]
In [ ]:
# Get the column containing the songs songs = df.song_id # Create a dictionary from songs to their number of users ratings_count = dict() for song in songs: # If we already have the song, just add 1 to their rating count if song in ratings_count: ratings_count[song] += 1 # Otherwise, set their rating count to 1 else: ratings_count[song] = 1
In [ ]:
# We want our song to be listened by atleast 120 users to be considred RATINGS_CUTOFF = 120 remove_songs = [] for song, num_ratings in ratings_count.items(): if num_ratings < RATINGS_CUTOFF: remove_songs.append(song) df_final= df.loc[ ~ df.song_id.isin(remove_songs)]
In [ ]:
# Drop records with play_count more than(>) 5 df_final = __________
In [ ]:
# Check the shape of the data
Exploratory Data Analysis
Let's check the total number of unique users, songs, artists in the data
Total number of unique user id
In [ ]:
# Display total number of unique user_id
Total number of unique song id
In [ ]:
# Display total number of unique song_id
Total number of unique artists
In [ ]:
# Display total number of unique artists
Observations and Insights:__
Let's find out about the most interacted songs and interacted users
Most interacted songs
In [ ]:
Most interacted users
In [ ]:
Observations and Insights:___
Songs played in a year
In [ ]:
count_songs = df_final.groupby('year').count()['title'] count = pd.DataFrame(count_songs) count.drop(count.index[0], inplace = True) count.tail()
Out[ ]:
title | |
---|---|
year | |
2006 | 7592 |
2007 | 13750 |
2008 | 14031 |
2009 | 16351 |
2010 | 4087 |
In [ ]:
# Create the plot # Set the figure size plt.figure(figsize = (30, 10)) sns.barplot(x = count.index, y = 'title', data = count, estimator = np.median) # Set the y label of the plot plt.ylabel('number of titles played') # Show the plot plt.show()
Observations and Insights:__
Think About It: What other insights can be drawn using exploratory data analysis?
Proposed approach
Potential techniques: What different techniques should be explored? Overall solution design: What is the potential solution design? Measures of success: What are the key measures of success to compare different potential technqiues?
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started