Question
As a data engineer, you are asked to do text analysis to find out the set of words that are frequently used within a file.
As a data engineer, you are asked to do text analysis to find out the set of words that are frequently used within a file. For this, you need to write a Map reduce program that identifies all the words whose length > 5 and the frequency of occurrence > 100.
Input Dataset: Dataset is present at the location (hdfs:///bigdatapgp/common_folder/assignment3/frequence)
Constraints:
- You should consider only the Alphabets and Digits, and ignore any special character (. , : ; - + etc.) while splitting the words.
- You should consider the words ROMAN, Roman, roman as same ( i.e. roman) while calculating the frequency.
Expected Output: List the words along with its frequency separated by space. For example,
roman 300 siward 240
....
Expected Solution: You need to paste the MR code, hadoop commands & path of the final jar that is used to achieve this output.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started