Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 30, 2024

In your capacity as a data scientist newly appointed to a distinguished baseball organisation, you have been tasked with a pivotal mission: to delve deep

In your capacity as a data scientist newly appointed to a distinguished baseball organisation, you have been tasked with a pivotal mission: to delve deep into a specific dataset that revolves around the implementation of distributed processing. The dataset at the centre of this analysis consists of intricate play

-

-

play baseball statistics meticulously curated by Retrosheet.

Retrosheet is a venerable institution celebrated for its unwavering commitment to amassing detailed play

-

-

play accounts from every Major League Baseball game. Their process extends far beyond mere data collection, as they employ rigorous validation and adhere to a standardised scoring protocol, which facilitates the transformation of this wealth of information into a structured computerised format.

Retrosheet's contribution to the world of baseball analytics is truly remarkable, having successfully transcribed play

-

-

play accounts for an astonishing

170, 000

games spanning all the way back to the year

1901 .

However, it is essential to acknowledge that the dataset provided is not devoid of imperfections. This imperfection is attributed to the presence of human error stemming from the manual encoding of records. Such inaccuracies necessitate our thorough attention and demand the development of robust strategies to rectify and enhance the data quality within our analysis.

You have been entrusted with a comprehensive dataset spanning a two

-

year period, encompassing various file types. These files are organised as follows:

Team Files: These files contain a comprehensive listing of teams participating in each year. Each team listing is uniquely identified by a

3 -

letter code that serves as a reference for that team across all other files. These files are consistently named with the "TEAM" prefix.

Roster Files: Roster files contain detailed information about the players associated with each team. The naming convention for these files involves using the

3 -

letter code of the team followed by the respective year. Roster files are characterised by the

.

ROS file extension.

Event Files: Event files document the home games of individual teams during a given year. The filenames begin with the year and are followed by a

3 -

letter code for the home team. Event files come in various extensions:

.

EVA extensions correspond to American League teams,

.

EVN extensions denote National League teams, and

.

EVE extensions signify post

-

season games. It's essential to note that the data within each of these files is organised with commas as delimiters, and the records are newline

-

terminated.

In this context, your responsibility, using the Hadoop framework, is to analyse and summarise the dataset by addressing the following tasks:

1.1 .

What's the Total number of represented games?

1.2 .

What's the Total number of records in the dataset?

1.3 .

What is the relationship between player IDs and player names?

Remember that you'll need to design MapReduce jobs to process and aggregate the data for each of these tasks. These jobs should read, parse, and summarise the data according to the questions you've posed. Once the Hadoop jobs are complete, you can generate reports or output files with the calculated summary information.

Additionally, you might need to address data quality and handle any potential data anomalies or inconsistencies during the parsing and processing of the dataset.

To attain the objectives outlined above, kindly provide the following:

Submit straightforward Pig scripts for the purpose of filtering out relevant records and subsequently performing a count.

Store the complete dataset on the Hadoop Distributed File System

(

HDFS

) .

Please furnish the requisite scripts for this task, accompanied by a screenshot.

Utilise Hive for the purpose of structuring and tabulating the data. Kindly supply both the script used for this operation and a screenshot to illustrate the process.

Script used in identifying consistencies across teams.