Question
Question: MapReduce You are a data scientist working with the United States Internal Revenue Service. The IRS maintains a registry of all United States individual
Question: MapReduce
You are a data scientist working with the United States Internal Revenue Service. The IRS maintains a registry of all United States individual taxpayers. For each taxpayer, the IRS stores the following attributes (this is not the full list):
- First Name,
- Middle Name 1,
- Middle Name 2,
- Last Name,
- Street (Address),
- City,
- State,
- Zip
The IRS wants to match their records (registry of tax payers) to 200 million DMV records with. The DMV records contain the same attributes as the IRS records. The IRS must determine whether a pair of (DMV, IRS) records refer to the same individual. To do this, they must compute a similarity score between every possible pair of DMV and IRS records. If there are 200 million IRS records, each DMV record will have 200 million possible matches and therefore 200 million similarity scores. They would like to end up with a collection of pairs, e.g. (DMVRecord_1, IRSRecord_234345) that represent the highest match each DMV record had with any IRS record. The final output will have 200 million pairs (the same number as available DMV records). Assuming that the IRS has given you a function that determines the similarity between two candidate pairs, your job is to design a MapReduce application to that generates the final matches.
Please answer the following questions, in your own words, on how you would design the MapReduce job:
1. The Mapper (1)
a. What is the input key and value combination (give the data types for the input key and
the input value)
b. What should the map function do to each input key value pair. Please be detailed and
specific
c. What is the output key value pair that is sent to the reducer (give the data types for the
output key and the output value)
2. The Reducer (1)
a. What are the datatypes for the key and values submitted by the mapper
b. What will the reducer do? What type of aggregation is required here?
c. What datatypes are needed for the key and value outputted from the reduce
1. The Mapper (2)
a. What is the input key and value combination (give the data types for the input key and
the input value)
b. What should the map function do to each input key value pair. Please be detailed and
specific
c. What is the output key value pair that is sent to the reducer (give the data types for the
output key and the output value)
2. The Reducer (2)
a. What are the datatypes for the key and values submitted by the mapper
b. What will the reducer do? What type of aggregation is required here?
c. What datatypes are needed for the key and value outputted from the reduce
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started