Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

CAN YOU PLEASE ANSWER ALL THE SIX QUESTIONS # Import all the required libraries import numpy as np import pandas as pd Download the dataset

CAN YOU PLEASE ANSWER ALL THE SIX QUESTIONS

# Import all the required libraries import numpy as np import pandas as pd

Download the dataset from here -

https://drive.google.com/drive/folders/10AcH3MvYo0aGK6uTB3FbYOqkwMBh18K8?usp=sharing

or you can also download it from:- https://grouplens.org/datasets/movielens/1m/

Reading the Data

column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"] ratings_data = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python') column_list_movies = ["MovieID","Title","Genres"] movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python', encoding = 'latin-1') column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"] user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')

Data analysis

data=pd.merge(pd.merge(ratings_data,user_data),movies_data) data

mean_ratings=data.pivot_table('Ratings','Title',aggfunc='mean') mean_ratings

mean_ratings=data.pivot_table('Ratings',index=["Title"],aggfunc='mean') top_15_mean_ratings = mean_ratings.sort_values(by = 'Ratings',ascending = False).head(15) top_15_mean_ratings

mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean') mean_ratings

data=pd.merge(pd.merge(ratings_data,user_data),movies_data) mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean') top_female_ratings = mean_ratings.sort_values(by='F', ascending=False) print(top_female_ratings.head(15)) top_male_ratings = mean_ratings.sort_values(by='M', ascending=False) print(top_male_ratings.head(15))

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F'] sorted_by_diff = mean_ratings.sort_values(by='diff') sorted_by_diff[:10]

ratings_by_title=data.groupby('Title').size() ratings_by_title.sort_values(ascending=False).head(10)

Question 1

Create ratings matrix using Numpy. This matrix allows us to see the ratings for a given movie and user ID. The element at location [,] is a rating given by user for movie . Print the shape of the matrix produced.

Additionally, choose 3 users that have rated the movie with MovieID "1377" (Batman Returns). Print these ratings, they will be used later for comparison.

Notes:

Do not use pivot_table.

A ratings matrix is not the same as ratings_data from above.

The ratings of movie with MovieID are stored in the (-1)th column (index starts from 0)

Not every user has rated every movie. Missing entries should be set to 0 for now.

If you're stuck, you might want to look into np.zeros and how to use it to create a matrix of the desired shape.

Every review lies between 1 and 5, and thus fits within a uint8 datatype, which you can specify to numpy.

# Create the matrix

# Print the shape

# Store and print ratings for Batman Returns

Question 2

Normalize the ratings matrix (created in Question 1) using Z-score normalization. While we can't use sklearn's StandardScaler for this step, we can do the statistical calculations ourselves to normalize the data.

Before you start:

Your first step should be to get the average of every column of the ratings matrix (we want an average by title, not by user!).

Make sure that the mean is calculated considering only non-zero elements. If there is a movie which is rated only by 10 users, we get its mean rating using (sum of the 10 ratings)/10 and NOT (sum of 10 ratings)/(total number of users)

All of the missing values in the dataset should be replaced with the average rating for the given movie. This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an "expected value" which is useful for computing correlations and recommendations in later steps.

In our matrix, 0 represents a missing rating.

Next, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every column. It may be very close but not exactly zero because of the limited precision floats allow.

Lastly, divide this by the standard deviation of the column.

Not every MovieID is used, leading to zero columns. This will cause a divide by zero error when normalizing the matrix. Simply replace any NaN values in your normalized matrix with 0.

Question 3

We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix from the previous question. Perform the process using numpy, and along the way print the shapes of the , , and matrices you calculated.

# Compute the SVD of the normalised matrix

# Print the shapes

Question 4

Reconstruct four rank-k rating matrix , where = for k = [100, 1000, 2000, 3000]. Using each of make predictions for the 3 users selected in Question 1, for the movie with ID 1377 (Batman Returns). Compare the original ratings with the predicted ratings.

Question 5

Cosine Similarity

Cosine similarity is a metric used to measure how similar two vectors are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within (,)[0,1][0,1].

0 means there is no similarity (perpendicular), where 11 (parallel) means that both the items are 100% similar.

(,)= / ||||||||

Based on the reconstruction rank-1000 rating matrix 1000 and the cosine similarity, sort the movies which are most similar. You will have a function top_movie_similarity which sorts data by its similarity to a movie with ID movie_id and returns the top items, and a second function print_similar_movies which prints the titles of said similar movies. Return the top 5 movies for the movie with ID 1377 (Batman Returns)

Note: While finding the cosine similarity, there are a few empty columns which will have a magnitude of zero resulting in NaN values. These should be replaced by 0, otherwise these columns will show most similarity with the given movie.

# Sort the movies based on cosine similarity def top_movie_similarity(data, movie_id, top_n=5): # Movie id starts from 1 #Use the calculation formula above pass def print_similar_movies(movie_titles, top_indices): print('Most Similar movies: ') # Print the top 5 movies for Batman Returns movie_id = 1377

Question 6

Movie Recommendations

Using the same process from Question 5, write top_user_similarity which sorts data by its similarity to a user with ID user_id and returns the top result. Then find the MovieIDs of the movies that this similar user has rated most highly, but that user_id has not yet seen. Find at least 5 movie recommendations for the user with ID 5954 and print their titles.

Hint: To check your results, find the genres of the movies that the user likes and compare with the genres of the recommended movies.

#Sort users based on cosine similarity def top_user_similarity(data, user_id): pass

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions

Question

Why does it take three lines to display amino acid information

Answered: 1 week ago