Question
Use Spark features for data analysis to derive valuable insights. Problem Statement: You are working as a Big Data consultant for an E-commerce company. Your
Use Spark features for data analysis to derive valuable insights.
Problem Statement: You are working as a Big Data consultant for an E-commerce company. Your role is to analyze sales data. The company has multiple stores across the globe. They want you to do the analytics of their sales transaction data. You need to provide valuable insights to understand their sales across cities, state on a daily and weekly basis. Also, provide various other insights regarding the review of the products.
Domain: E-Commerce
Analysis to be done: Exploratory analysis, to determine actionable insights.
Dataset File: olist_public_dataset.csv
Content:
-
Id
-
order_status
-
order_products_value
-
order_freight_value
-
order_items_qty
-
order_purchase_timestamp
-
order_aproved_at
-
order_delivered_customer_date
-
customer_city
-
customer_state
-
customer_zip_code_prefix
-
product_name_lenght
-
product_description_lenght
-
product_photos_qty
-
review_score
Insights on Historical Data
-
Daily Insights
-
SALES
-
Total sales.
-
Total Sales in each Customer City.
-
Total sales in each Customer State.
-
-
ORDERS
-
Total number of orders sold.
-
City wise order distribution.
-
State wise order distribution.
-
Average Review score per Order.
-
Average Freight charges per order.
-
Average time taken to approve the orders. (Order Approved Order Purchased).
-
Average order delivery time.
-
-
-
Weekly Insights
-
SALES
-
Total sales.
-
Total Sales in each Customer City.
-
Total sales in each Customer State.
-
-
ORDERS
-
Total number of orders sold.
-
City wise order distribution.
-
State wise order distribution.
-
Average Review score per Order.
-
Average Freight charges per order.
-
Average time taken to approve the orders. (Order Approved Order Purchased).
-
Average order delivery time.
-
-
Total Freight charges.
-
Freight charges distribution in each Customer City
-
Approach
Tasks to perform:
Week 1: Approach Overview and Basic Configurations
-
Install maven (3.6.2).
-
Set environment variable of Maven
a) Check if maven is setup properly using mvn -version
-
Install Java 1.8 and Scala 2.11.7
-
Use Intellij to validate or modify source code
-
Click mvn clean install to build jar file
-
Use README.md for details instructions and helper commands
Week 2: Data Ingestion
-
Upload the entire data into Hive from CSV
-
Copy the data from Hive into HDFS
-
Check the data in HDFS path
Week 3 : Data Streaming
-
Create sample Maven Scala Project
-
Add necessary spark dependencies
-
Create Schema of CSV files
-
Create Spark Session
a) Add S3 details
b) Add all variables to your environment as they have sensitive data
-
Read CSV file and convert into dataset
-
Create Map of City and Country
-
Convert Date to Hour, Month, Year, Daily, and Day Bucket using UDF
-
Iterate through all metrics for each column
-
For each type of segment, calculate stats of different cities. Stats include max, min, average, and total records
Week 4 : Data Analysis and Visualization
-
Write the results into the HDFS
-
Save final dataset into Amazon S3
-
Create Amazon Document DB Cluster
-
Save insights in Document DB and provide APIs to view aggregate data
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started