Question
((Please run the Code and take screenshot )) The given code is a combination of two Pig Latin scripts. The first script finds the oldest
((Please run the Code and take screenshot ))
The given code is a combination of two Pig Latin scripts. The first script finds the oldest 5-star movies, and the second script finds bad movies with an average rating below 2.0. The schema and content of the metadata relation are also shown.
To run these scripts, you need to save them to separate files with .pig extension and execute them using the Pig interpreter. You also need to ensure that the input data files exist in the specified locations.
-> The first script is used to find the oldest 5-star movies:
-- Load the ratings data with a given schema
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime:int);
-- Load the metadata with a specified delimiter and schema
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRelease:chararray, imdbLink:chararray);
-- Extract the movie ID, title, and release time from the metadata relation
nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;
-- Group the ratings by movie ID
ratingsByMovie = GROUP ratings BY movieID;
-- Calculate the average rating for each movie
avgRatings = FOREACH ratingsByMovie GENERATE group AS movieID, AVG(ratings.rating) AS avgRating;
-- Filter the movies with an average rating greater than 4.0
fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;
-- Join the five-star movies with the name lookup relation to get the movie title and release time
fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;
-- Order the movies by release time to get the oldest 5-star movies
oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;
-- Output the results
DUMP oldestFiveStarMovies;
Explanation for step 1
This script loads the movie ratings data and metadata from two files in HDFS, using LOAD with a given schema and USING PigStorage('|') with a specified delimiter and schema, respectively. The metadata relation is transformed to extract the movie ID, title, and release time using FOREACH and ToDate/ToUnixTime built-in functions. The ratings are grouped by movie ID, and the average rating is calculated for each movie using AVG. The movies with an average rating greater than 4.0 are filtered using FILTER, and the result is joined with the name lookup relation using JOIN. Finally, the movies are ordered by release time using ORDER, and the results are displayed using DUMP.
-> The second script is used to find bad movies with an average rating below 2.0:
-- Load the ratings data with a given schema
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime:int);
-- Load the metadata with a specified delimiter and schema
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRelease:chararray, imdbLink:chararray);
-- Extract the movie ID and title from the metadata relation
nameLookup = FOREACH metadata GENERATE movieID, movieTitle;
-- Group the ratings by movie ID
groupedRatings = GROUP ratings BY movieID;
-- Calculate the average rating and number of ratings for each movie
averageRatings = FOREACH groupedRatings GENERATE group AS movieID, AVG(ratings.rating) AS avgRating, COUNT(ratings.rating) AS numRatings;
-- Filter the movies with an average rating less than 2.0
badMovies = FILTER averageRatings BY avgRating < 2.0;
-- Join the bad movies with the name lookup relation to get
Explanation for step 2
This script loads the movie ratings data and metadata from two files in HDFS, using LOAD with a given schema and USING PigStorage('|') with a specified delimiter and schema, respectively. The metadata relation is transformed to extract the movie ID, title, and release time using FOREACH and ToDate/ToUnixTime built-in functions. The ratings are grouped by movie ID, and the average rating is calculated for each movie using AVG. The movies with an average rating greater than 4.0 are filtered using FILTER, and the result is joined with the name lookup relation using JOIN. Finally, the movies are ordered by release time using ORDER, and the results are displayed using DUMP.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started