Question
Please help the second job, need help for mapper.py and reducer.py with Python. The purpose of this project is to build upon your first wordcount
Please help the second job, need help for mapper.py and reducer.py with Python.
The purpose of this project is to build upon your first wordcount project. We want to find the 11th most frequent word in the Cranfield collection which is a set of text documents. The collection is in the following directory:
/assignment2/data.txt
Copy the text file to your HDFS directory before running any of the programs. Make sure the output of your programs are located in your own HDFS directory as well.
In order to complete the assignment you will need to run two MapReduce jobs.
1 First Job: Word Count.
The first job is the wordcount program which simply counts the number of times a word occurred in the Cranfield collection. You can use the instructions in the first assignment to generate the word counts for the Cranfield collection. The output of the first MapReduce program will be given as input to the second MapReduce job.
2 Second Job: Sort Counts.
The mapping phase of the second job has to take the output of the previous job as input, which is in the form of < , >, and then output < , > pairs. During sort and shuffle phases after mapping, the < , > pairs need to be sorted by the keys.
The reduce phase will simply take the output of the mapper phase and writes the result to the output directory.
You can use the output directory of the previous job in the argument for the second job as:
-input /user//output1
And the output option for the second job as:
-output /user//output2
Hint: You can take advantage of the options provided for the mapper to compare the keys during sort and shuffle:https://hadoop.apache.org/docs/r1.2.1/streaming.html
3 Submission.
After successfully running both programs, the final output for the second MapReduce job should generate the following results for the first 11 words with highest frequency in the Cranfield collection:
Output:
20179 the
13964 of
11046 .
7053 and
6413 a
4972 in
4669 to
4075 is
3699 for
2430 with
2420 are
The assignment will be graded based on the second MapReduce job that you have to implement and successfully generating the final result in your HDFS directoy.
------------------------------
Need help for this part:
The input (data) will be like this( sorted with Alphabet order from a to z, but more than 11 words):
input:
. 11046
a 6413
and 7053
are 2420
boy 200
child 150
for 3699
in 4972
is 4075
jim 28
of 13964
the 20179
to 4669
with 2430
-----------
The output result is the first 11 words with highest frequency(order goes by the counter#):
20179 the
13964 of
11046 .
7053 and
6413 a
4972 in
4669 to
4075 is
3699 for
2430 with
2420 are
---------
Need to submit mapper.py & reduce.py.
will use dic, put words & counts to a dictionary and sort.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started