Download the dataset from here https drive google com drive folders 10AcH3MvYo0aGK6uTB3FbYOqkwMBh18K8 usp sharing or you can also download it from https grouplens org datasets movielens 1m column list ratings UserID , MovieID , Ratings , Timestamp ratings data pd read csv('ratings dat',sep ' ',names column list ratings, engine 'python') column list movies MovieID , Title , Genres movies data pd read csv('movies dat',sep ' ',names column list movies, engine 'python', encoding 'latin 1') column list users UserID , Gender , Age , Occupation , Zixp code user data pd read csv( users dat ,sep ,names column list users, engine 'python') Question 1 make ratings matrix using Numpy This matrix allows us to see the ratings for a given movie and user ID The element at location , is a rating given by user for movie Print the shape of the matrix produced Additionally, choose 3 users that have rated the movie with MovieID 1377 (Batman Returns) Print these ratings, they will be used later for comparison Notes Do not use pivot table A ratings matrix is not the same as ratings data from above The ratings of movie with MovieID are stored in the ( 1)th column (index starts from 0) Not every user has rated every movie Missing entries should be set to 0 for now If you're stuck, you might want to look into np zeros and how to use it to create a matrix of the desired shape Every review lies between 1 and 5, and thus fits within a uint8 datatype, which you can specify to numpy Create the matrix Print the shape Store and print ratings for Batman Returns Question 2 Normalize the ratings matrix (created in Question 1 ) using Z score normalization While we can't use sklearn's StandardScaler for this step, we can do the statistical calculations ourselves to normalize the data Before you start Your first step should be to get the average of every column of the ratings matrix (we want an average by title, not by user ) Make sure that the mean is calculated considering only non zero elements If there is a movie which is rated only by 10 users, we get its mean rating using (sum of the 10 ratings) 10 and NOT (sum of 10 ratings) (total number of users) All of the missing values in the dataset should be replaced with the average rating for the given movie This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an expected value which is useful for computing correlations and recommendations in later steps In our matrix, 0 represents a missing rating Next, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every column It may be very close but not exactly zero because of the limited precision floats allow Lastly, divide this by the standard deviation of the column Not every MovieID is used, leading to zero columns This will cause a divide by zero error when normalizing the matrix Simply replace any NaN values in your normalized matrix with 0 Question 3 We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix from the previous question Perform the process using numpy, and along the way print the shapes of the , , and matrices you calculated Compute the SVD of the normalised matrix Print the shapes Question 4 Reconstruct four rank k rating matrix , where for k 100, 1000, 2000, 3000 Using each of make predictions for the 3 users selected in Question 1, for the movie with ID 1377 (Batman Returns) Compare the original ratings with the predicted ratings Question 5 Cosine Similarity Cosine similarity is a metric used to measure how similar two vectors are Mathematically, it measures the cosine of the angle between two vectors projected in a multi dimensional space Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within (,) 0,1 0,1 0 means there is no similarity (perpendicular), where 11 (parallel) means that both the items are 100 similar (,) Based on the reconstruction rank 1000 rating matrix 1000 and the cosine similarity, sort the movies which are most similar You will have a function top movie similarity which sorts data by its similarity to a movie with ID movie id and returns the top items, and a second function print similar movies which prints the titles of said similar movies Return the top 5 movies for the movie with ID 1377 ( Batman Returns ) Note While finding the cosine similarity, there are a few empty columns which will have a magnitude of zero resulting in NaN values These should be replaced by 0, otherwise these columns will show most similarity with the given movie Sort the movies based on cosine similarity def top movie similarity(data, movie id, top n 5) Movie id starts from 1 Use the calculation formula above pass def print similar movies(movie titles, top indices) print('Most Similar movies ') Print the top 5 movies for Batman Returns movie id 1377 Question 6 Movie Recommendations Using the same process from Question 5, write top user similarity which sorts data by its similarity to a user with ID user id and returns the top result Then find the MovieIDs of the movies that this similar user has rated most highly, but that user id has not yet seen Find at least 5 movie recommendations for the user with ID 5954 and print their titles Hint To check your results, find the genres of the movies that the user likes and compare with the genres of the recommended movies

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Mar 01, 2024

Download the dataset from here - https://drive.google.com/drive/folders/10AcH3MvYo0aGK6uTB3FbYOqkwMBh18K8?usp=sharing or you can also download it from:- https://grouplens.org/datasets/movielens/1m/ column_list_ratings = [UserID, MovieID, Ratings,Timestamp] ratings_data = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings,

Download the dataset from here -

https://drive.google.com/drive/folders/10AcH3MvYo0aGK6uTB3FbYOqkwMBh18K8?usp=sharing

or you can also download it from:- https://grouplens.org/datasets/movielens/1m/

column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"]    ratings_data  = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python')    column_list_movies = ["MovieID","Title","Genres"]    movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python', encoding = 'latin-1')    column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"]    user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')

Question 1

make ratings matrix using Numpy. This matrix allows us to see the ratings for a given movie and user ID. The element at location [,] is a rating given by user for movie . Print the shape of the matrix produced.

Additionally, choose 3 users that have rated the movie with MovieID "1377" (Batman Returns). Print these ratings, they will be used later for comparison.

Notes:

Do not use pivot_table.
A ratings matrix is not the same as ratings_data from above.
The ratings of movie with MovieID are stored in the (-1)th column (index starts from 0)
Not every user has rated every movie. Missing entries should be set to 0 for now.
If you're stuck, you might want to look into np.zeros and how to use it to create a matrix of the desired shape.
Every review lies between 1 and 5, and thus fits within a uint8 datatype, which you can specify to numpy.

# Create the matrix

# Print the shape

# Store and print ratings for Batman Returns

Question 2

Normalize the ratings matrix (created in Question 1) using Z-score normalization. While we can't use sklearn's StandardScaler for this step, we can do the statistical calculations ourselves to normalize the data.

Before you start:

Your first step should be to get the average of every column of the ratings matrix (we want an average by title, not by user!).
Make sure that the mean is calculated considering only non-zero elements. If there is a movie which is rated only by 10 users, we get its mean rating using (sum of the 10 ratings)/10 and NOT (sum of 10 ratings)/(total number of users)
All of the missing values in the dataset should be replaced with the average rating for the given movie. This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an "expected value" which is useful for computing correlations and recommendations in later steps.
In our matrix, 0 represents a missing rating.
Next, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every column. It may be very close but not exactly zero because of the limited precision floats allow.
Lastly, divide this by the standard deviation of the column.
Not every MovieID is used, leading to zero columns. This will cause a divide by zero error when normalizing the matrix. Simply replace any NaN values in your normalized matrix with 0.

Question 3

We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix from the previous question. Perform the process using numpy, and along the way print the shapes of the , , and matrices you calculated.

# Compute the SVD of the normalised matrix

# Print the shapes

Question 4

Reconstruct four rank-k rating matrix , where = for k = [100, 1000, 2000, 3000]. Using each of make predictions for the 3 users selected in Question 1, for the movie with ID 1377 (Batman Returns). Compare the original ratings with the predicted ratings.

Question 5

Cosine Similarity

Cosine similarity is a metric used to measure how similar two vectors are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within (,)[0,1][0,1].

0 means there is no similarity (perpendicular), where 11 (parallel) means that both the items are 100% similar.

(,)= / ||||||||

Based on the reconstruction rank-1000 rating matrix ₁₀₀₀ and the cosine similarity, sort the movies which are most similar. You will have a function top_movie_similarity which sorts data by its similarity to a movie with ID movie_id and returns the top items, and a second function print_similar_movies which prints the titles of said similar movies. Return the top 5 movies for the movie with ID 1377 (Batman Returns)

Note: While finding the cosine similarity, there are a few empty columns which will have a magnitude of zero resulting in NaN values. These should be replaced by 0, otherwise these columns will show most similarity with the given movie.

# Sort the movies based on cosine similarity    def top_movie_similarity(data, movie_id, top_n=5):       # Movie id starts from 1       #Use the calculation formula above       pass    def print_similar_movies(movie_titles, top_indices):       print('Most Similar movies: ')    # Print the top 5 movies for Batman Returns    movie_id = 1377

Question 6

Movie Recommendations

Using the same process from Question 5, write top_user_similarity which sorts data by its similarity to a user with ID user_id and returns the top result. Then find the MovieIDs of the movies that this similar user has rated most highly, but that user_id has not yet seen. Find at least 5 movie recommendations for the user with ID 5954 and print their titles.

Hint: To check your results, find the genres of the movies that the user likes and compare with the genres of the recommended movies.