Question
Download the dataset from here - https://drive.google.com/drive/folders/10AcH3MvYo0aGK6uTB3FbYOqkwMBh18K8?usp=sharing or you can also download it from:- https://grouplens.org/datasets/movielens/1m/ column_list_ratings = [UserID, MovieID, Ratings,Timestamp] ratings_data = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings,
Download the dataset from here -
https://drive.google.com/drive/folders/10AcH3MvYo0aGK6uTB3FbYOqkwMBh18K8?usp=sharing
or you can also download it from:- https://grouplens.org/datasets/movielens/1m/
column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"] ratings_data = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python') column_list_movies = ["MovieID","Title","Genres"] movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python', encoding = 'latin-1') column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"] user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')
Question 1
make ratings matrix using Numpy. This matrix allows us to see the ratings for a given movie and user ID. The element at location [,] is a rating given by user for movie . Print the shape of the matrix produced.
Additionally, choose 3 users that have rated the movie with MovieID "1377" (Batman Returns). Print these ratings, they will be used later for comparison.
Notes:
- Do not use pivot_table.
- A ratings matrix is not the same as ratings_data from above.
- The ratings of movie with MovieID are stored in the (-1)th column (index starts from 0)
- Not every user has rated every movie. Missing entries should be set to 0 for now.
- If you're stuck, you might want to look into np.zeros and how to use it to create a matrix of the desired shape.
- Every review lies between 1 and 5, and thus fits within a uint8 datatype, which you can specify to numpy.
# Create the matrix
# Print the shape
# Store and print ratings for Batman Returns
Question 2
Normalize the ratings matrix (created in Question 1) using Z-score normalization. While we can't use sklearn's StandardScaler for this step, we can do the statistical calculations ourselves to normalize the data.
Before you start:
- Your first step should be to get the average of every column of the ratings matrix (we want an average by title, not by user!).
- Make sure that the mean is calculated considering only non-zero elements. If there is a movie which is rated only by 10 users, we get its mean rating using (sum of the 10 ratings)/10 and NOT (sum of 10 ratings)/(total number of users)
- All of the missing values in the dataset should be replaced with the average rating for the given movie. This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an "expected value" which is useful for computing correlations and recommendations in later steps.
- In our matrix, 0 represents a missing rating.
- Next, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every column. It may be very close but not exactly zero because of the limited precision floats allow.
- Lastly, divide this by the standard deviation of the column.
- Not every MovieID is used, leading to zero columns. This will cause a divide by zero error when normalizing the matrix. Simply replace any NaN values in your normalized matrix with 0.
Question 3
We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix from the previous question. Perform the process using numpy, and along the way print the shapes of the , , and matrices you calculated.
# Compute the SVD of the normalised matrix
# Print the shapes
Question 4
Reconstruct four rank-k rating matrix , where = for k = [100, 1000, 2000, 3000]. Using each of make predictions for the 3 users selected in Question 1, for the movie with ID 1377 (Batman Returns). Compare the original ratings with the predicted ratings.
Question 5
Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within (,)[0,1][0,1].
0 means there is no similarity (perpendicular), where 11 (parallel) means that both the items are 100% similar.
(,)= / ||||||||
Based on the reconstruction rank-1000 rating matrix 1000 and the cosine similarity, sort the movies which are most similar. You will have a function top_movie_similarity which sorts data by its similarity to a movie with ID movie_id and returns the top items, and a second function print_similar_movies which prints the titles of said similar movies. Return the top 5 movies for the movie with ID 1377 (Batman Returns)
Note: While finding the cosine similarity, there are a few empty columns which will have a magnitude of zero resulting in NaN values. These should be replaced by 0, otherwise these columns will show most similarity with the given movie.
# Sort the movies based on cosine similarity def top_movie_similarity(data, movie_id, top_n=5): # Movie id starts from 1 #Use the calculation formula above pass def print_similar_movies(movie_titles, top_indices): print('Most Similar movies: ') # Print the top 5 movies for Batman Returns movie_id = 1377
Question 6
Movie Recommendations
Using the same process from Question 5, write top_user_similarity which sorts data by its similarity to a user with ID user_id and returns the top result. Then find the MovieIDs of the movies that this similar user has rated most highly, but that user_id has not yet seen. Find at least 5 movie recommendations for the user with ID 5954 and print their titles.
Hint: To check your results, find the genres of the movies that the user likes and compare with the genres of the recommended movies.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started