Use Spark features for data analysis to derive valuable insights Problem Statement You are working as a Big Data consultant for an E commerce company Your role is to analyze sales data The company has multiple stores across the globe They want you to do the analytics of their sales transaction data You need to provide valuable insights to understand their sales across cities, state on a daily and weekly basis Also, provide various other insights regarding the review of the products Domain E Commerce Analysis to be done Exploratory analysis, to determine actionable insights Dataset File olist public dataset csv Content Id order status order products value order freight value order items qty order purchase timestamp order aproved at order delivered customer date customer city customer state customer zip code prefix product name lenght product description lenght product photos qty review score Insights on Historical Data Daily Insights SALES Total sales Total Sales in each Customer City Total sales in each Customer State ORDERS Total number of orders sold City wise order distribution State wise order distribution Average Review score per Order Average Freight charges per order Average time taken to approve the orders (Order Approved Order Purchased) Average order delivery time Weekly Insights SALES Total sales Total Sales in each Customer City Total sales in each Customer State ORDERS Total number of orders sold City wise order distribution State wise order distribution Average Review score per Order Average Freight charges per order Average time taken to approve the orders (Order Approved Order Purchased) Average order delivery time Total Freight charges Freight charges distribution in each Customer City Approach Tasks to perform Week 1 Approach Overview and Basic Configurations Install maven (3 6 2) Set environment variable of Maven a) Check if maven is setup properly using mvn version Install Java 1 8 and Scala 2 11 7 Use Intellij to validate or modify source code Click mvn clean install to build jar file Use README md for details instructions and helper commands Week 2 Data Ingestion Upload the entire data into Hive from CSV Copy the data from Hive into HDFS Check the data in HDFS path Week 3 Data Streaming Create sample Maven Scala Project Add necessary spark dependencies Create Schema of CSV files Create Spark Session a) Add S3 details b) Add all variables to your environment as they have sensitive data Read CSV file and convert into dataset Create Map of City and Country Convert Date to Hour, Month, Year, Daily, and Day Bucket using UDF Iterate through all metrics for each column For each type of segment, calculate stats of different cities Stats include max, min, average, and total records Week 4 Data Analysis and Visualization Write the results into the HDFS Save final dataset into Amazon S3 Create Amazon Document DB Cluster Save insights in Document DB and provide APIs to view aggregate data

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 24, 2024

Use Spark features for data analysis to derive valuable insights. Problem Statement: You are working as a Big Data consultant for an E-commerce company. Your

Use Spark features for data analysis to derive valuable insights.

Problem Statement: You are working as a Big Data consultant for an E-commerce company. Your role is to analyze sales data. The company has multiple stores across the globe. They want you to do the analytics of their sales transaction data. You need to provide valuable insights to understand their sales across cities, state on a daily and weekly basis. Also, provide various other insights regarding the review of the products.

Domain: E-Commerce

Analysis to be done: Exploratory analysis, to determine actionable insights.

Dataset File: olist_public_dataset.csv

Content:

Id
order_status
order_products_value
order_freight_value
order_items_qty
order_purchase_timestamp
order_aproved_at
order_delivered_customer_date
customer_city
customer_state
customer_zip_code_prefix
product_name_lenght
product_description_lenght
product_photos_qty
review_score

Insights on Historical Data

Daily Insights
1. SALES
  - Total sales.
  - Total Sales in each Customer City.
  - Total sales in each Customer State.
2. ORDERS
  - Total number of orders sold.
  - City wise order distribution.
  - State wise order distribution.
  - Average Review score per Order.
  - Average Freight charges per order.
  - Average time taken to approve the orders. (Order Approved Order Purchased).
  - Average order delivery time.
Weekly Insights
1. SALES
  - Total sales.
  - Total Sales in each Customer City.
  - Total sales in each Customer State.
2. ORDERS
  - Total number of orders sold.
  - City wise order distribution.
  - State wise order distribution.
  - Average Review score per Order.
  - Average Freight charges per order.
  - Average time taken to approve the orders. (Order Approved Order Purchased).
  - Average order delivery time.
3. Total Freight charges.
4. Freight charges distribution in each Customer City

Approach

Tasks to perform:

Week 1: Approach Overview and Basic Configurations

Install maven (3.6.2).
Set environment variable of Maven

a) Check if maven is setup properly using mvn -version

Install Java 1.8 and Scala 2.11.7
Use Intellij to validate or modify source code
Click mvn clean install to build jar file
Use README.md for details instructions and helper commands

Week 2: Data Ingestion

Upload the entire data into Hive from CSV
Copy the data from Hive into HDFS
Check the data in HDFS path

Week 3 : Data Streaming

Create sample Maven Scala Project
Add necessary spark dependencies
Create Schema of CSV files
Create Spark Session

a) Add S3 details

b) Add all variables to your environment as they have sensitive data

Read CSV file and convert into dataset
Create Map of City and Country
Convert Date to Hour, Month, Year, Daily, and Day Bucket using UDF
Iterate through all metrics for each column
For each type of segment, calculate stats of different cities. Stats include max, min, average, and total records

Week 4 : Data Analysis and Visualization