Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

DESCRIPTION To use Spark features for data analysis and showing the valuable insights. Problem Statement: You are working as a Big Data engineer in an

DESCRIPTION

To use Spark features for data analysis and showing the valuable insights.

Problem Statement: You are working as a Big Data engineer in an insurance company. Your job is to analyze road traffic and accidents and also the weather data and derive valuable insights.

Domain: BFSI

Analysis to be done: Exploratory analysis, to determine actionable insights of each city.

Content: The dataset contains weather information of 5 years of multiple cities on hourly basis. It contains data of 30 US and Canadian Cities, as well as of 6 Israeli cities. For each city few details are added i.e. country, latitude and longitude information in a separate file.

Analysis to be done: Exploratory analysis, to determine actionable insights of each city.

Content:

The dataset contains weather information of 5 years of multiple cities on hourly basis. It contains data of 30 US and Canadian Cities, as well as of 6 Israeli cities. For each city few details are added i.e. country, latitude and longitude information in a separate file.

Dataset contains below CSV files

  • city_attributes.csv == 4 columns

    • City

    • Country

    • Latitude

    • Longitude

  • humidity.csv == 37 columns

    • Datetime

    • Humidity of 36 cities

  • pressure.csv == 37 columns

    • Datetime

    • Pressure of 36 cities

  • temperature.csv == 37 columns

    • Datetime

    • Temperature of 36 cities

  • weather_description.csv == 37 columns

    • Datetime

    • Weather Description of 36 cities

  • wind_direction.csv == 37 columns

    • Datetime

    • Wind Direction of 36 cities

  • wind_speed.csv == 37 columns

    • Datetime

    • Wind Speed of 36 cities

Data Link:

  • https://www.kaggle.com/selfishgene/historical-hourly-weather-data

Insights on Historical Data

  1. Create below segments

    1. 00:00 to 00:04

    2. 00:04 to 00:08

    3. 00:08 to 00:12

    4. 00:12 to 16:00

    5. 16:00 to 20:00

    6. 20:00 to 24:00

    7. Daily

    8. Monthly

  2. In each segment calculate below values

    1. For Numerical Columns

      1. Minimum

      2. Maximum

      3. Average

      4. Total records

    2. For weather description

      1. Description and percentage

      2. Total Samples

  3. Once spark jobs are done aggregated data files should be to S3.

Weather Dashboard API

Build Weather dashboard to use APIs to show data to end user.

  1. Details of each weather attributes by city

    1. Daily

    2. Monthly

  2. Details of cities by Country

    1. Returns all cities data

      1. Daily

      2. Monthly

You must store data in AWS Document DB.

  1. Collection to store 4-hour aggregate data

    1. DB Queries

      1. Get segments data by city or date range

  2. Collection to store daily aggregate data

    1. DB Queries

      1. Get daily data by city or date range

  3. Collection to store monthly aggregate data

    1. DB Queries

      1. Get monthly data by city or month range

  4. Country Collection

    1. For country find all cities data on a given date

    2. Week 1: Approach Overview and Basic Configurations

    3. Install maven (3.6.2).

    4. Set environment variable of Maven

    5. a) Check if maven is setup properly using mvn -version

    6. Install Java 1.8 and Scala 2.11.7

    7. Use Intellij to validate or modify source code

    8. Click mvn clean install to build jar file

    9. Use README.md for details instructions and helper commands

    10. Create Kafka Producer providing Kafka configurations

    11. Read CSV files to send data to Kafka

    12. Run Kafka Producer 7 times with different file and topics

    13. Week 2: ETL with Flume

    14. Download Flume

    15. Upload flume tar file to HDFS, download and extract in Hadoop node

    16. Mention zookeeper and topic in the conf. file to consume the data (Above folder contains different flume configuration as we are using multiple topics)

    17. Run flume agent

    18. Week 3 : Data Streaming

    19. Create sample Maven Scala Project

    20. Add necessary spark dependencies

    21. Create Schema of CSV files

    22. Create Spark Session

    23. a) Add S3 details

      b) Add all variables to your environment as they have sensitive data

    24. Read CSV file and convert into dataset

    25. Create Map of City and Country

    26. Convert Date to Hour, Month, Year, Daily, and Day Bucket using UDF

    27. Iterate through all metrics for each column

    28. For each type of segment, calculate stats of different cities. Stats include max, min, average, and total records

    29. Week 4 : Data Analysis and Visualization

    30. Save final dataset into Amazon S3

    31. For running Spark jobs, refer README.md

    32. Create Amazon Document DB Cluster

    33. Create RedHat Linux Machine on Amazon EC2 and configure Linux packages

    34. Configure and connect to MongoDB CLI

    35. Build a cluster with MongoDB

    36. Connect your application on the cluster

    37. Connect to Spark using MongoDB cluster on AWS

    38. Finally, you will be able to view the actual reports and data

    39. Check data in HDFS path is configured in flume conf.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions

Question

3. How is money associated with subjective well-being?

Answered: 1 week ago

Question

What is one of the skills required for independent learning?Explain

Answered: 1 week ago