[Solved] DESCRIPTION To use Spark features for dat

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 08, 2024

DESCRIPTION To use Spark features for data analysis and showing the valuable insights. Problem Statement: You are working as a Big Data engineer in an

DESCRIPTION

To use Spark features for data analysis and showing the valuable insights.

Problem Statement: You are working as a Big Data engineer in an insurance company. Your job is to analyze road traffic and accidents and also the weather data and derive valuable insights.

Domain: BFSI

Analysis to be done: Exploratory analysis, to determine actionable insights of each city.

Content: The dataset contains weather information of 5 years of multiple cities on hourly basis. It contains data of 30 US and Canadian Cities, as well as of 6 Israeli cities. For each city few details are added i.e. country, latitude and longitude information in a separate file.

Analysis to be done: Exploratory analysis, to determine actionable insights of each city.

Content:

The dataset contains weather information of 5 years of multiple cities on hourly basis. It contains data of 30 US and Canadian Cities, as well as of 6 Israeli cities. For each city few details are added i.e. country, latitude and longitude information in a separate file.

Dataset contains below CSV files

city_attributes.csv == 4 columns
- City
- Country
- Latitude
- Longitude
humidity.csv == 37 columns
- Datetime
- Humidity of 36 cities
pressure.csv == 37 columns
- Datetime
- Pressure of 36 cities
temperature.csv == 37 columns
- Datetime
- Temperature of 36 cities
weather_description.csv == 37 columns
- Datetime
- Weather Description of 36 cities
wind_direction.csv == 37 columns
- Datetime
- Wind Direction of 36 cities
wind_speed.csv == 37 columns
- Datetime
- Wind Speed of 36 cities

Data Link:

https://www.kaggle.com/selfishgene/historical-hourly-weather-data

Insights on Historical Data

Create below segments
1. 00:00 to 00:04
2. 00:04 to 00:08
3. 00:08 to 00:12
4. 00:12 to 16:00
5. 16:00 to 20:00
6. 20:00 to 24:00
7. Daily
8. Monthly
In each segment calculate below values
1. For Numerical Columns
  1. Minimum
  2. Maximum
  3. Average
  4. Total records
2. For weather description
  1. Description and percentage
  2. Total Samples
Once spark jobs are done aggregated data files should be to S3.

Weather Dashboard API

Build Weather dashboard to use APIs to show data to end user.

Details of each weather attributes by city
1. Daily
2. Monthly
Details of cities by Country
1. Returns all cities data
  1. Daily
  2. Monthly

You must store data in AWS Document DB.

Collection to store 4-hour aggregate data
1. DB Queries
  1. Get segments data by city or date range
Collection to store daily aggregate data
1. DB Queries
  1. Get daily data by city or date range
Collection to store monthly aggregate data
1. DB Queries
  1. Get monthly data by city or month range
Country Collection
1. For country find all cities data on a given date
2. Week 1: Approach Overview and Basic Configurations
3. Install maven (3.6.2).
4. Set environment variable of Maven
5. a) Check if maven is setup properly using mvn -version
6. Install Java 1.8 and Scala 2.11.7
7. Use Intellij to validate or modify source code
8. Click mvn clean install to build jar file
9. Use README.md for details instructions and helper commands
10. Create Kafka Producer providing Kafka configurations
11. Read CSV files to send data to Kafka
12. Run Kafka Producer 7 times with different file and topics
13. Week 2: ETL with Flume
14. Download Flume
15. Upload flume tar file to HDFS, download and extract in Hadoop node
16. Mention zookeeper and topic in the conf. file to consume the data (Above folder contains different flume configuration as we are using multiple topics)
17. Run flume agent
18. Week 3 : Data Streaming
19. Create sample Maven Scala Project
20. Add necessary spark dependencies
21. Create Schema of CSV files
22. Create Spark Session
23. a) Add S3 details
  
  b) Add all variables to your environment as they have sensitive data
24. Read CSV file and convert into dataset
25. Create Map of City and Country
26. Convert Date to Hour, Month, Year, Daily, and Day Bucket using UDF
27. Iterate through all metrics for each column
28. For each type of segment, calculate stats of different cities. Stats include max, min, average, and total records
29. Week 4 : Data Analysis and Visualization
30. Save final dataset into Amazon S3
31. For running Spark jobs, refer README.md
32. Create Amazon Document DB Cluster
33. Create RedHat Linux Machine on Amazon EC2 and configure Linux packages
34. Configure and connect to MongoDB CLI
35. Build a cluster with MongoDB
36. Connect your application on the cluster
37. Connect to Spark using MongoDB cluster on AWS
38. Finally, you will be able to view the actual reports and data
40. Check data in HDFS path is configured in flume conf.