Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

As a data engineer, you are asked to do text analysis to find out the set of words that are frequently used within a file.

As a data engineer, you are asked to do text analysis to find out the set of words that are frequently used within a file. For this, you need to write a Map reduce program that identifies all the words whose length > 5 and the frequency of occurrence > 100.

Input Dataset: Dataset is present at the location (hdfs:///bigdatapgp/common_folder/assignment3/frequence)

Constraints:

  • You should consider only the Alphabets and Digits, and ignore any special character (. , : ; - + etc.) while splitting the words.
  • You should consider the words ROMAN, Roman, roman as same ( i.e. roman) while calculating the frequency.

Expected Output: List the words along with its frequency separated by space. For example,

roman 300 siward 240

....

Expected Solution: You need to paste the MR code, hadoop commands & path of the final jar that is used to achieve this output.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Data Management Databases And Organizations

Authors: Richard T. Watson

3rd Edition

0471418455, 978-0471418450

More Books

Students also viewed these Databases questions