Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

CSV data with flat schema with multiple records and features RecordNo Invoice StockCode Description Quantity InvoiceDate Price CustomerID Country 4 5 2 3 0 C

CSV data with flat schema with multiple records and features

RecordNo Invoice StockCode Description Quantity InvoiceDate Price CustomerID Country

45230

493411 21539

RETRO SPOTS BUTTER DISH

- 1 01 - 04 - 2010 09

43 4.25 14590

United Kingdom

45231 493412

TEST

001

This is a test product.

5 01 - 04 - 2010 09

53 4.5 12346

United Kingdom

45232 493413 21724

PANDA AND BUNNIES STICKER SHEET

1 01 - 04 - 2010 09

54 0.85

United Kingdom

45233 493413 84578

ELEPHANT TOY WITH BLUE T

-

SHIRT

1 01 - 04 - 2010 09

54 3.75

United Kingdom

45234 493413 21723

ALPHABET HEARTS STICKER SHEET

1 01 - 04 - 2010 09

54 0.85

United Kingdom

45235 493414 21844

RETRO SPOT MUG

36 01 - 04 - 2010 10

28 2.55 14590

United Kingdom

45236 493414 21533

RETRO SPOT LARGE MILK JUG

12 01 - 04 - 2010 10

28 4.25 14590

United Kingdom

Input: CSV data with flat schema with multiple records and features.

Description:

1 .

STORAGE:

The data file should be copied to the local file system of any node in your

Hadoop cluster. This data file is to be moved to HDFS of the Hadoop cluster

by configuring and running a suitable Flume agent. The block size of the file

should be selected for optimum performance. A suitable value for the

replication factor of the file should be selected to ensure reliable storage of

the data file.

2 .

METADATA

The data consists of RecordNo, InvoiceNo, StockCode, Description, Quantity,

InvoiceDate, Price, CustomerID, and Country. Some of the fields in the data

may be blank. If required, you are allowed to remove the first header record

containing the schema definition. Or this record may be skipped during reading

and or analysis. No other modifications are allowed to the contents of the file.

Big Data Systems Assignment

2 2

3 .

ANALYTIC QUERIES FOR BENCH MARKING:

1 .

Total revenue

(

Aggregation of Price

)

received in the year

2010 .

2 .

List of unique items sold

(

With same StockCode

)

and their total sales

volume

(

Aggregation of Quantity

)

in the year

2010

sorted in ascending

order of StockCode.

4 .

FRAMEWORKS

/

PLATFORMS TO BE COMPARED:

.

Pig Latin Scripts

5 .

GUIDELINES FOR PERFORMANCE COMPARISON:

1 .

You need to select one framework from Hadoop group and the second

framework from the Spark group given in Section

4

above. It is NOT allowed to

select two frameworks from the same group. In this assignment, you need to

do a query performance comparison between the two frameworks selected by

you. Two queries to be used for performance evaluation are given in Section

3

Analytics queries for benchmarking.

2 .

If you are using Linux, you can

time

command to time your command. For

Windows, you need to find out a method to determine the time taken for

execution of each of the queries. Sometimes, the time taken for execution of a

query can be less than

1

second and you may not be able to measure time in

millisecond range. You have the following

2

options to overcome this problem:

.

Repeat the query multiple times, say

10

100

and determine the total

time taken.Then find out the time taken for executing individual queries.

.

Almost all the platforms mentioned above allow you to specify a folder in

HDFS as input. You may copy multiple copies of the same data file into

the input folder

(

of course with different file names

)

and execute the

query. Then find out the query time by dividing the total time by the

number of copies of the file.

Big Data Systems Assignment

2 3

6 .

CONDITIONS

1 .

Since this is a group assignment involving comparison of performance on

2

different frameworks, one student should work on

1

platform and other

student

(

)

should work on the second platform. The group leader needs to

consolidate the results and submit the assignment.

2 .

You should use Apache Flume to move data from the local file system to

HDFS

.

If data is moved with the Hadoop put command, marks will be

reduced.

3 .

The Hadoop cluster should be configured on Linux

/

Windows systems.

4 .

If only one system is available, you need to configure the cluster in

pseudo distributed mode.

5 .

The Replication factor for the HDFS files should be set as the number of

nodes in the cluster.

6 .

Focus on performance tuning of the framework by selecting proper

configuration parameters instead of accuracy of the query results.

Your submission should consist of all the following

7

items:

1 .

Configuration files of Hadoop cluster

/

Spark and frameworks like Pig,

Hive, HBase used in your solution. Include only part of the configuration

files which you have modified.

2 .

The configuration of the Flume agent developed by you to transfer the

data file from local filesystem to HDFS folder.

3 .

The code, scripts, and query developed for any

2

of the selected platform:

System details of your Hadoop cluster

(

from all nodes, if you are using

more than one node

)

.

CPU clock speed and number of cores, Memory size in GB

.

.

UUID of the system

(

On Linux

-

sudo dmidecode

-

t system

|

grep UUID

)

(

On Windows

-

wmic path win

32_

computersystemproduct get u

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David M. Kroenke

1st Edition

★★★★★

What are Dimensional Relational Databases designed to hold primarily?

Answered: 1 week ago

Previous Question Next Question