Answered step by step
Verified Expert Solution
Question
1 Approved Answer
BIG DATA SYSTEMS ASSIGNMENT 2 Title: Selection of suitable Hadoop Eco System platform for sales data analytics Overview & background: An e - Commerce company
BIG DATA SYSTEMS ASSIGNMENT
Title:
Selection of suitable Hadoop Eco System platform for sales data analytics
Overview & background:
An eCommerce company in Europe is analyzing the online retail sales data of to
formulate the sales strategy for The name of the sales data file is Assignment
BDSDATASETonlineretaildat.csv This file contains header and data
records.
The analytics team consists of developers with expertise on different Hadoop
frameworksplatforms like MapReduce, Pig Latin, Hive, HBase and Spark. The technical
manager is confused about the appropriate framework to be used in the analytics
project. The only concern of the manager is that the time taken for execution of the
analytics queries should be minimum. In this assignment, you need to implement
typical analytics queries in any two of the platforms and do a performance comparison
between them and recommend the frameworkplatform which gives better performance.
The scriptscodequeries developed by you on both the platforms, the timing values and
the query results are to be submitted for each of the queries mentioned in section
Also, you need to submit the UUIDs of computers on which you have done the
performance tests.
Input: CSV data with flat schema with multiple records and features.
Description:
STORAGE:
The data file should be copied to the local file system of any node in your
Hadoop cluster. This data file is to be moved to HDFS of the Hadoop cluster
by configuring and running a suitable Flume agent. The block size of the file
should be selected for optimum performance. A suitable value for the
replication factor of the file should be selected to ensure reliable storage of
the data file.
METADATA
The data consists of RecordNo, InvoiceNo, StockCode, Description, Quantity,
InvoiceDate, Price, CustomerID, and Country. Some of the fields in the data
may be blank. If required, you are allowed to remove the first header record
containing the schema definition. Or this record may be skipped during reading
and or analysis. No other modifications are allowed to the contents of the file.
Big Data Systems Assignment
ANALYTIC QUERIES FOR BENCH MARKING:
Total revenue Aggregation of Price received in the year
List of unique items sold With same StockCode and their total sales
volume Aggregation of Quantity in the year sorted in ascending
order of StockCode.
FRAMEWORKS PLATFORMS TO BE COMPARED:
Hadoop group
a Hadoop MapReduce
b Pig Latin Scripts
c Apache HIVE
d Apache HBASE
Spark group
a Spark MapReduce
b Spark Dataframes
c Spark Datasets
d SparkSQL
GUIDELINES FOR PERFORMANCE COMPARISON:
You need to select one framework from Hadoop group and the second
framework from the Spark group given in Section above. It is NOT allowed to
select two frameworks from the same group. In this assignment, you need to
do a query performance comparison between the two frameworks selected by
you. Two queries to be used for performance evaluation are given in Section
Analytics queries for benchmarking.
If you are using Linux, you can time command to time your command. For
Windows, you need to find out a method to determine the time taken for
execution of each of the queries. Sometimes, the time taken for execution of a
query can be less than second and you may not be able to measure time in
millisecond range. You have the following options to overcome this problem:
a Repeat the query multiple times, say to and determine the total
time taken.Then find out the time taken for executing individual queries.
b Almost all the platforms mentioned above allow you to specify a folder in
HDFS as input. You may copy multiple copies of the same data file into
the input folder of course with different file names and execute the
query. Then find out the query time by dividing the total time by the
number of copies of the file.
Big Data Systems Assignment
CONDITIONS
Since this is a group assignment involving comparison of performance on
different frameworks, one student should work on platform and other
students should work on the second platform. The group leader needs to
consolidate the results and submit the assignment.
You should use Apache Flume to move data from the local file system to
HDFS If data is moved with the Hadoop put command, marks will be
reduced.
The Hadoop cluster should be configured on Linux Windows systems.
If only one system is available, you need to configure the cluster in
pseudo distributed mode.
The Replication factor for the HDFS files should be set as the number of
nodes in the cluster.
Focus on performance tuning of the framework by selecting proper
configuration parameters instead of accuracy of the query results.
SUBMISSION REQUIREMENTS
Your submission
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started