Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Please help the second job, need help for mapper.py and reducer.py with Python. The purpose of this project is to build upon your first wordcount

Please help the second job, need help for mapper.py and reducer.py with Python.

The purpose of this project is to build upon your first wordcount project. We want to find the 11th most frequent word in the Cranfield collection which is a set of text documents. The collection is in the following directory:

/assignment2/data.txt

Copy the text file to your HDFS directory before running any of the programs. Make sure the output of your programs are located in your own HDFS directory as well.

In order to complete the assignment you will need to run two MapReduce jobs.

1 First Job: Word Count.

The first job is the wordcount program which simply counts the number of times a word occurred in the Cranfield collection. You can use the instructions in the first assignment to generate the word counts for the Cranfield collection. The output of the first MapReduce program will be given as input to the second MapReduce job.

2 Second Job: Sort Counts.

The mapping phase of the second job has to take the output of the previous job as input, which is in the form of < , >, and then output < , > pairs. During sort and shuffle phases after mapping, the < , > pairs need to be sorted by the keys.

The reduce phase will simply take the output of the mapper phase and writes the result to the output directory.

You can use the output directory of the previous job in the argument for the second job as:

-input /user//output1

And the output option for the second job as:

-output /user//output2

Hint: You can take advantage of the options provided for the mapper to compare the keys during sort and shuffle:https://hadoop.apache.org/docs/r1.2.1/streaming.html

3 Submission.

After successfully running both programs, the final output for the second MapReduce job should generate the following results for the first 11 words with highest frequency in the Cranfield collection:

Output:

20179 the

13964 of

11046 .

7053 and

6413 a

4972 in

4669 to

4075 is

3699 for

2430 with

2420 are

The assignment will be graded based on the second MapReduce job that you have to implement and successfully generating the final result in your HDFS directoy.

------------------------------

Need help for this part:

The input (data) will be like this( sorted with Alphabet order from a to z, but more than 11 words):

input:

. 11046

a 6413

and 7053

are 2420

boy 200

child 150

for 3699

in 4972

is 4075

jim 28

of 13964

the 20179

to 4669

with 2430

-----------

The output result is the first 11 words with highest frequency(order goes by the counter#):

20179 the

13964 of

11046 .

7053 and

6413 a

4972 in

4669 to

4075 is

3699 for

2430 with

2420 are

---------

Need to submit mapper.py & reduce.py.

will use dic, put words & counts to a dictionary and sort.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Professional Visual Basic 6 Databases

Authors: Charles Williams

1st Edition

1861002025, 978-1861002020

More Books

Students also viewed these Databases questions

Question

How many Tables Will Base HCMSs typically have? Why?

Answered: 1 week ago

Question

What is the process of normalization?

Answered: 1 week ago