Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Use Spark features for data analysis to derive valuable insights. Problem Statement: You are working as a Big Data consultant for an E-commerce company. Your

Use Spark features for data analysis to derive valuable insights.

Problem Statement: You are working as a Big Data consultant for an E-commerce company. Your role is to analyze sales data. The company has multiple stores across the globe. They want you to do the analytics of their sales transaction data. You need to provide valuable insights to understand their sales across cities, state on a daily and weekly basis. Also, provide various other insights regarding the review of the products.

Domain: E-Commerce

Analysis to be done: Exploratory analysis, to determine actionable insights.

Dataset File: olist_public_dataset.csv

Content:

  1. Id

  2. order_status

  3. order_products_value

  4. order_freight_value

  5. order_items_qty

  6. order_purchase_timestamp

  7. order_aproved_at

  8. order_delivered_customer_date

  9. customer_city

  10. customer_state

  11. customer_zip_code_prefix

  12. product_name_lenght

  13. product_description_lenght

  14. product_photos_qty

  15. review_score

Insights on Historical Data

  1. Daily Insights

    1. SALES

      • Total sales.

      • Total Sales in each Customer City.

      • Total sales in each Customer State.

    2. ORDERS

      • Total number of orders sold.

      • City wise order distribution.

      • State wise order distribution.

      • Average Review score per Order.

      • Average Freight charges per order.

      • Average time taken to approve the orders. (Order Approved Order Purchased).

      • Average order delivery time.

  2. Weekly Insights

    1. SALES

      • Total sales.

      • Total Sales in each Customer City.

      • Total sales in each Customer State.

    2. ORDERS

      • Total number of orders sold.

      • City wise order distribution.

      • State wise order distribution.

      • Average Review score per Order.

      • Average Freight charges per order.

      • Average time taken to approve the orders. (Order Approved Order Purchased).

      • Average order delivery time.

    3. Total Freight charges.

    4. Freight charges distribution in each Customer City

Approach

Tasks to perform:

Week 1: Approach Overview and Basic Configurations

  1. Install maven (3.6.2).

  2. Set environment variable of Maven

a) Check if maven is setup properly using mvn -version

  1. Install Java 1.8 and Scala 2.11.7

  2. Use Intellij to validate or modify source code

  3. Click mvn clean install to build jar file

  4. Use README.md for details instructions and helper commands

Week 2: Data Ingestion

  1. Upload the entire data into Hive from CSV

  2. Copy the data from Hive into HDFS

  3. Check the data in HDFS path

Week 3 : Data Streaming

  1. Create sample Maven Scala Project

  2. Add necessary spark dependencies

  3. Create Schema of CSV files

  4. Create Spark Session

a) Add S3 details

b) Add all variables to your environment as they have sensitive data

  1. Read CSV file and convert into dataset

  2. Create Map of City and Country

  3. Convert Date to Hour, Month, Year, Daily, and Day Bucket using UDF

  4. Iterate through all metrics for each column

  5. For each type of segment, calculate stats of different cities. Stats include max, min, average, and total records

Week 4 : Data Analysis and Visualization

  1. Write the results into the HDFS

  2. Save final dataset into Amazon S3

  3. Create Amazon Document DB Cluster

  4. Save insights in Document DB and provide APIs to view aggregate data

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Systems Introduction To Databases And Data Warehouses

Authors: Nenad Jukic, Susan Vrbsky, Svetlozar Nestorov

1st Edition

1943153191, 978-1943153190

More Books

Students also viewed these Databases questions