Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay

Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
198710143741730912849 PS 1451 NA 9179 NA 2311 SAN SFO 447 NA NA 0 NA 0 NA NA NA NA NA
198710154729730903849 PS 1451 NA 9479 NA 14-1 SAN SFO 447 NA NA 0 NA 0 NA NA NA NA NA
198710176741730918849 PS 1451 NA 9779 NA 2911 SAN SFO 447 NA NA 0 NA 0 NA NA NA NA NA
198710187729730847849 PS 1451 NA 7879 NA -2-1 SAN SFO 447 NA NA 0 NA 0 NA NA NA NA NA
198710191749730922849 PS 1451 NA 9379 NA 3319 SAN SFO 447 NA NA 0 NA 0 NA NA NA NA NA
198710213728730848849 PS 1451 NA 8079 NA -1-2 SAN SFO 447 NA NA 0 NA 0 NA NA NA NA NA
this is file content data give me map reduce job code cosider this data In this project, you will develop an Oozie workflow to process and analyze a large volume of flight data.
Instructions:
1. Students will be automatically placed in groups of 2-3 for this project.
2. Install Hadoop/Oozie on your AWS VMs.
3. Download the Airline On-time Performance data set (flight data set) from the period of October 1987 to April 2008 on the following website: Data Expo 2009: Airline on-time dataLinks to an external site.
4. Design, implement, and run an Oozie workflow to find out the following:
o 3 airlines with the highest and lowest probability, respectively, for being on schedule;
o 3 airports with the longest and shortest average taxi time per flight (both in and out), respectively, and
o most common reason for flight cancellations.
Requirements:
1. Your workflow must contain at least three MapReduce jobs that run in fully distributed mode.
2. Run your workflow to analyze the entire data set (total 22 years from 1987 to 2008) at one time on two VMs first and then gradually increase the system scale to the maximum allowed number of VMs for at least 5 increment steps and measure each corresponding workflow execution time.
3. Run your workflow to analyze the data in a progressive manner with an increment of 1 year, i.e., the first year (1987), the first 2 years (1987-1988), the first 3 years (1987-1989),..., and the total 22 years (1987-2008), on the maximum allowed number of VMs, and measure each corresponding workflow execution time.
Milestone 1 Submission:
1. A project report in PDF that includes:
A diagram that shows the structure of your Oozie workflow
A detailed description of the algorithm you designed to solve each of the problems
give me diagram structure of ozzie workflow and detail description of slove each map reduce jobs

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions

Question

How can you improve WAN performance?

Answered: 1 week ago

Question

Explain the various methods of job evaluation

Answered: 1 week ago

Question

Differentiate Personnel Management and Human Resource Management

Answered: 1 week ago

Question

Describe the functions of Human resource management

Answered: 1 week ago

Question

What are the objectives of Human resource planning ?

Answered: 1 week ago

Question

How would we like to see ourselves?

Answered: 1 week ago