Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1. Random Sampling a data stream - 75 points The Problem Statement: Your data is a stream of items of unknown length that we

image text in transcribed

1. Random Sampling a data stream - 75 points The Problem Statement: Your data is a stream of items of unknown length that we can only iterate over once. Also, the data is expected to be very large, so while you are developing your program, you'd like to work on a statistically accurate representative sample. You would need to implement an algorithm that randomly chooses an item from the data stream such that each item is equally likely to be selected. The Algorithm: The algorithm for this problem is the Reservoir Sampling algorithm. http://en.wikipedia.org/wiki/Reservoir_sampling. An good simpler explanation and a Python-only implementation is shown here: https://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know- 43c7bc11d17c (if you hit paywall for this link, retry it on an incognito window). The Data: data_Q1_2019.zip This dataset contains the actual daily SMART logs for all hard drives used in a data center during the first quarter of 2019. Note that over the course of the three months, some drives will fail and new one will come into use. There are over 900k entries in this dataset. SMART, https://en.wikipedia.org/wiki/Self-Monitoring, Analysis_and_Reporting_Technology TODO: Create a random subset of 50k entries of this data using Reservoir Sampling. (there are about 900K entries) Sample size k = 50,000 = 50k 1. Implement Reservoir Sampling in Hadoop MapReduce - 50 points 2. Implement Reservoir Sampling in Spark - 25 points

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Making Hard Decisions with decision tools

Authors: Robert Clemen, Terence Reilly

3rd edition

538797576, 978-0538797573

More Books

Students also viewed these Mathematics questions

Question

What is the equation of a straight line?

Answered: 1 week ago