Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Need help with a small music recommendation system project in python? Music Recommendation System Milestone 1 Problem Definition The context: Why is this problem important

Need help with a small music recommendation system project in python?

Music Recommendation System

Milestone 1

Problem Definition

The context: Why is this problem important to solve? The objectives: What is the intended goal? The key questions: What are the key questions that need to be answered? The problem formulation: What are we trying to solve using data science?

Data Dictionary

The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id, and the play count of users.

song_data

song_id - A unique id given to every song

title - Title of the song

Release - Name of the released album

Artist_name - Name of the artist

year - Year of release

count_data

user _id - A unique id given to the user

song_id - A unique id given to the song

play_count - Number of times the song was played

Data Source

http://millionsongdataset.com/

Important Notes

This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken to get a feasible solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem, and we encourage you to feel free and explore them as an 'optional' exercise.

In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

All the outputs in the notebook are just for reference and can be different if you follow a different approach.

There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.

Importing Libraries and the Dataset

In [ ]:

# Mounting the drive from google.colab import drive drive.mount('/content/drive') 
Mounted at /content/drive 

In [ ]:

# Used to ignore the warning given as output of the code import warnings warnings.filterwarnings('ignore') # Basic libraries of python for numeric and dataframe computations import numpy as np import pandas as pd # Basic library for data visualization import matplotlib.pyplot as plt # Slightly advanced library for data visualization import seaborn as sns # To compute the cosine similarity between two vectors from sklearn.metrics.pairwise import cosine_similarity # A dictionary output that does not raise a key error from collections import defaultdict # A performance metrics in sklearn from sklearn.metrics import mean_squared_error 

Load the dataset

In [ ]:

# Importing the datasets count_df = pd.read_csv('/content/drive/MyDrive/ADSP/capstone/count_data.csv') song_df = pd.read_csv('/content/drive/MyDrive/ADSP/capstone/song_data.csv') 

Understanding the data by viewing a few observations

In [ ]:

# See top 10 records of count_df data 

In [ ]:

# See top 10 records of song_df data 

Let us check the data types and and missing values of each column

In [ ]:

# See the info of the count_df data 

In [ ]:

# See the info of the song_df data 

Observations and Insights:_

In [ ]:

# Left merge the count_df and song_df data on "song_id". Drop duplicates from song_df data simultaneously # Drop the column 'Unnamed: 0' 

Think About It: As the user_id and song_id are encrypted. Can they be encoded to numeric features?

In [ ]:

# Apply label encoding for "user_id" and "song_id" 

Think About It: As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it contains users who have listened to a good count of songs and vice versa?

In [ ]:

# Get the column containing the users users = df.user_id # Create a dictionary from users to their number of songs ratings_count = dict() for user in users: # If we already have the user, just add 1 to their rating count if user in ratings_count: ratings_count[user] += 1 # Otherwise, set their rating count to 1 else: ratings_count[user] = 1 

In [ ]:

# We want our users to have listened at least 90 songs RATINGS_CUTOFF = 90 # Create a list of users who need to be removed remove_users = [] for user, num_ratings in ratings_count.items(): if num_ratings < RATINGS_CUTOFF: remove_users.append(user) df = df.loc[ ~ df.user_id.isin(remove_users)] 

In [ ]:

# Get the column containing the songs songs = df.song_id # Create a dictionary from songs to their number of users ratings_count = dict() for song in songs: # If we already have the song, just add 1 to their rating count if song in ratings_count: ratings_count[song] += 1 # Otherwise, set their rating count to 1 else: ratings_count[song] = 1 

In [ ]:

# We want our song to be listened by atleast 120 users to be considred RATINGS_CUTOFF = 120 remove_songs = [] for song, num_ratings in ratings_count.items(): if num_ratings < RATINGS_CUTOFF: remove_songs.append(song) df_final= df.loc[ ~ df.song_id.isin(remove_songs)] 

In [ ]:

# Drop records with play_count more than(>) 5 df_final = __________ 

In [ ]:

# Check the shape of the data 

Exploratory Data Analysis

Let's check the total number of unique users, songs, artists in the data

Total number of unique user id

In [ ]:

# Display total number of unique user_id 

Total number of unique song id

In [ ]:

# Display total number of unique song_id 

Total number of unique artists

In [ ]:

# Display total number of unique artists 

Observations and Insights:__

Let's find out about the most interacted songs and interacted users

Most interacted songs

In [ ]:

 

Most interacted users

In [ ]:

 

Observations and Insights:___

Songs played in a year

In [ ]:

count_songs = df_final.groupby('year').count()['title'] count = pd.DataFrame(count_songs) count.drop(count.index[0], inplace = True) count.tail() 

Out[ ]:

title
year
2006 7592
2007 13750
2008 14031
2009 16351
2010 4087

In [ ]:

# Create the plot # Set the figure size plt.figure(figsize = (30, 10)) sns.barplot(x = count.index, y = 'title', data = count, estimator = np.median) # Set the y label of the plot plt.ylabel('number of titles played') # Show the plot plt.show() 

Observations and Insights:__

Think About It: What other insights can be drawn using exploratory data analysis?

Proposed approach

Potential techniques: What different techniques should be explored? Overall solution design: What is the potential solution design? Measures of success: What are the key measures of success to compare different potential technqiues?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database And Transaction Processing

Authors: Philip M. Lewis, Arthur Bernstein, Michael Kifer

1st Edition

0201708728, 978-0201708721

More Books

Students also viewed these Databases questions

Question

explain the concept of strategy formulation

Answered: 1 week ago