Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

Question: MapReduce You are a data scientist working with the United States Internal Revenue Service. The IRS maintains a registry of all United States individual

Question: MapReduce

You are a data scientist working with the United States Internal Revenue Service. The IRS maintains a registry of all United States individual taxpayers. For each taxpayer, the IRS stores the following attributes (this is not the full list):

First Name,
Middle Name 1,
Middle Name 2,
Last Name,
Street (Address),
City,
State,
Zip

The IRS wants to match their records (registry of tax payers) to 200 million DMV records with. The DMV records contain the same attributes as the IRS records. The IRS must determine whether a pair of (DMV, IRS) records refer to the same individual. To do this, they must compute a similarity score between every possible pair of DMV and IRS records. If there are 200 million IRS records, each DMV record will have 200 million possible matches and therefore 200 million similarity scores. They would like to end up with a collection of pairs, e.g. (DMVRecord_1, IRSRecord_234345) that represent the highest match each DMV record had with any IRS record. The final output will have 200 million pairs (the same number as available DMV records). Assuming that the IRS has given you a function that determines the similarity between two candidate pairs, your job is to design a MapReduce application to that generates the final matches.

Please answer the following questions, in your own words, on how you would design the MapReduce job:

1. The Mapper (1)

a. What is the input key and value combination (give the data types for the input key and

the input value)

b. What should the map function do to each input key value pair. Please be detailed and

specific

c. What is the output key value pair that is sent to the reducer (give the data types for the

output key and the output value)

2. The Reducer (1)

a. What are the datatypes for the key and values submitted by the mapper

b. What will the reducer do? What type of aggregation is required here?

c. What datatypes are needed for the key and value outputted from the reduce

1. The Mapper (2)

a. What is the input key and value combination (give the data types for the input key and

the input value)

b. What should the map function do to each input key value pair. Please be detailed and

specific

c. What is the output key value pair that is sent to the reducer (give the data types for the

output key and the output value)

2. The Reducer (2)

a. What are the datatypes for the key and values submitted by the mapper

b. What will the reducer do? What type of aggregation is required here?

c. What datatypes are needed for the key and value outputted from the reduce

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Processing Fundamentals Design And Implementation

Authors: KROENKE DAVID M.

1st Edition

★★★★★

After designing a Multidimensional Database in Visual Studio, what are the next steps that build the Database in the Analysis Services Instance? How is the build out of the Analytical Services...

Answered: 1 week ago

Previous Question Next Question