Answered step by step
Verified Expert Solution
Question
1 Approved Answer
In your capacity as a data scientist newly appointed to a distinguished baseball organisation, you have been tasked with a pivotal mission: to delve deep
In your capacity as a data scientist newly appointed to a distinguished baseball organisation, you have been tasked with a pivotal mission: to delve deep into a specific dataset that revolves around the implementation of distributed processing. The dataset at the centre of this analysis consists of intricate playbyplay baseball statistics meticulously curated by Retrosheet.
Retrosheet is a venerable institution celebrated for its unwavering commitment to amassing detailed playbyplay accounts from every Major League Baseball game. Their process extends far beyond mere data collection, as they employ rigorous validation and adhere to a standardised scoring protocol, which facilitates the transformation of this wealth of information into a structured computerised format.
Retrosheet's contribution to the world of baseball analytics is truly remarkable, having successfully transcribed playbyplay accounts for an astonishing games spanning all the way back to the year However, it is essential to acknowledge that the dataset provided is not devoid of imperfections. This imperfection is attributed to the presence of human error stemming from the manual encoding of records. Such inaccuracies necessitate our thorough attention and demand the development of robust strategies to rectify and enhance the data quality within our analysis.
You have been entrusted with a comprehensive dataset spanning a twoyear period, encompassing various file types. These files are organised as follows:
Team Files: These files contain a comprehensive listing of teams participating in each year. Each team listing is uniquely identified by a letter code that serves as a reference for that team across all other files. These files are consistently named with the "TEAM" prefix.
Roster Files: Roster files contain detailed information about the players associated with each team. The naming convention for these files involves using the letter code of the team followed by the respective year. Roster files are characterised by the ROS file extension.
Event Files: Event files document the home games of individual teams during a given year. The filenames begin with the year and are followed by a letter code for the home team. Event files come in various extensions: EVA extensions correspond to American League teams, EVN extensions denote National League teams, and EVE extensions signify postseason games. It's essential to note that the data within each of these files is organised with commas as delimiters, and the records are newlineterminated.
In this context, your responsibility, using the Hadoop framework, is to analyse and summarise the dataset by addressing the following tasks:
What's the Total number of represented games?
What's the Total number of records in the dataset?
What is the relationship between player IDs and player names?
Remember that you'll need to design MapReduce jobs to process and aggregate the data for each of these tasks. These jobs should read, parse, and summarise the data according to the questions you've posed. Once the Hadoop jobs are complete, you can generate reports or output files with the calculated summary information.
Additionally, you might need to address data quality and handle any potential data anomalies or inconsistencies during the parsing and processing of the dataset.
To attain the objectives outlined above, kindly provide the following:
Submit straightforward Pig scripts for the purpose of filtering out relevant records and subsequently performing a count.
Store the complete dataset on the Hadoop Distributed File System HDFS Please furnish the requisite scripts for this task, accompanied by a screenshot.
Utilise Hive for the purpose of structuring and tabulating the data. Kindly supply both the script used for this operation and a screenshot to illustrate the process.
Script used in identifying consistencies across teams.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started