Question: Part 3 : Counting Sequences of ( N ) - grams ( 2 5 % ) Using WordLengthCount.java as a starting point,

Part

3

: Counting Sequences of

\ (

\) -

grams

(25 \ %)

Using WordLengthCount.java as a starting point, extend it to count the sequences of

\ (

\) -

grams

(

call the program NgramCount.java

) .

\ (

\) -

gram is a sequence of

\ (

\)

consecutive words, where

\ (

\)

is the input in the command

-

line argument.

Output Format: In this problem, we define a word as a sequence of English alphanumeric characters

(

.

.,

dogs, cat

31,

etc.

) .

Each line contains a tuple of

(1

st word,

2

nd word,

. . ., \ (

\) -

th word, count

),

where the adjacent items are separated by non

-

alphanumeric characters

(

.

.,

space, tab, newline, or punctuations

) .

You may assume that the datasets contain only printable ASCII characters, all in English. You can output the count of each sequence in any order. Words are case

-

sensitive

.

That is

,

"Apple apple" and "apple apple" should be counted and displayed separately. See the example below, where

\ (

= 2 \) .

Command Format:

\

$ hadoop jar

[.

jar file

] [

class name

] [

input dir

] [

output dir

] [

]

Sample Input:

'The quick brown fox jumps

over the lazy dogs.

Sample Output:

The quick

1

quick brown

1

brown fox

1

fox jumps

1

jumps over

1

over the

1

the lazy

1

lazy dogs

1

Notes:

-

We treat the punctuations

"' "

and

" . "

as delimiters. We treat only "The" and "dogs" as words.

-

For a word like

"

' \ (

2

\) ",

we treat it as two words

"

"

and

" \ (

2

\) " .

-

The word in the end of a line and the word in the beginning of the next line can also form an

\ (

\) -

gram. For example, "jumps over" is an

\ (

\) -

gram for

\ (

= 2 \) .

-

Multiple non

-

alphanumeric characters are treated as a single delimiter. For example, "quick

,,,,,

brown" will form a

2 -

gram "quick brown".

-

We may miss some

\ (

\) -

grams that appear in two different HDFS blocks, but it's okay.

-

You don't need to implement any combining technique here, but you are strongly encouraged to think of any possible way to speed up your program.

` ` `

import java.io

.

IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs

.

Path;

import org.apache.hadoop.io

.

IntWritable;

import org:apache.hadoop.ios.Text;

import org.apache.hadoop:mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop:mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FilleOutputFormat;

public class WordCount

{

public static class TokenizerMapper

extends Mapper

0

bject, Text, Text, IntWritable

> {

private final static IntWritable one

=

new IntWritable

(1)

;

private Text word

=

new Text

()

;

public void map

(

Object key, Text value, Context context

)

throws IOException, InterruptedException

{

StringTokenizer itr.

=

new StringTokenizer

(

value

.

toString

())

;

while

(

itr

.

hasMoreTokens

()) {

word.set

(

itr

.

nextToken

())

;

context.write

(

word

,

one

)

;

}

}

}

public static class IntSumReducer

extends Reducer

{

private IntWritable result

=

new IntWritable

()

;

public void reduce

(

Text key, Iterable values,

Context context

)

throws IOException, InterruptedException

{

int sum

= 0

;

for

(

IntWritable val : values

) {

}

sum

+ =

val.get

()

;

result.set

(

sum

)

;

context.write

(

key

,

result

)

;

}

}

public static void main

(

String

[]

args

)

throws Exception

{

Configuration conf

=

new Configuration

()

;

Job job

=

Job.getInstance

(

conf

,

"word count"

)

;

job.setJarByClass

(

WordCount

.

class

)

;

job.setMapperClass

(

TokenizerMapper

.

class

)

;

job.setCombinerClass

(

IntSumReducer

.

class

)

;

job.setReducerClass

(

IntSumReducer

.

class

)

;

job.setOutputKeyClass

(

Text

.

class

)

;

job.setOutputValueClass

(

IntWritable

.

class

)

;

FileInputFormat.addInputPath

(

job

,

new Path

(

args

[0]))

;

File

0

utputFormat.setOutputPath

(

job

,

new Path

(

args

[1]))

;

System.exit

(

job

.

waitForCompletion

(

true

) ? 0

1)

;

}

}

` ` `

$Part 3 : Counting Sequences of \ ( N \ ) - grams$

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

The purpose of this assignment is to get you to implement and use lists. The assignment consists of three (3) problems, corresponding to minimum (C), expected (B), and outstanding (A) levels of...

I need to finish assignment Assignment 6 WPs 21-1, 21-2, 21-3, 21-5, 21-6, 21-17, 21-18, 21-19, 21-20, 21-21 Assignment 7 WPs 30-1, 30-2, 30-4, 30-5, 30-21, 30-22, 30-23, 30-24 Assignment 8 WPs 20-1,...

nodes, but at least its bias can be quantified by Markov Chain L. INTRODUCTION analysis and thus can be corrected via appropriate re-weighting The popularity of online social networks (OSNs) in...

Tasks The goal of the project is to complete the code for the NgramAnalyser, MarkovModel, ModelMatcher and MatcherController classes, as detailed below, and to add test code to a new JUnit test...

Not A \"Pioneer\" But Recognize New Market Requirements? January 2012 CMS recently named 32 Pioneer ACOs and has begun funding Innovation Grants to help prepare for new models of care delivery....

2. Software Requirement In this task, you will be using the Logisim circuit drawing software to create a circuit for the problem specified in this task. You must use the Logisim simulator version...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

Let i and j be positive integers. (i) Prove that there exist natural numbers a and b such that ai = bj+gcd(i, j). You may use standard results provided that you state them clearly. [4 marks] (ii) Let...

CHAPTER 3 Population and Disease Patterns and Trends Stephen J. Williams CHAPTER TOPICS Need, Demand, and Utilization W I L S O N , LEARNING OBJECTIVES Upon completing this chapter, the reader should...

1 / 3 81% +] [G] Your objective is to implement the following three algorithms: From the list of quadratic time algorithms: Cocktail Sort From the list of linearithmic time algorithms: Quick Sort...

Explain how options can be used to protect a short position.

Determine the amount to be paid in full settlement of each of the following invoices, assuming that credit for returns and allowances was received prior to payment and that all invoices were paid...

what tools should be used to compare actual to budgeted variable company overhead costs?

9 Which one of the approaches to marketing implementation is considered the most advantageous in every implementation situation? Implementation by command Implementation as organizational cul