Question: Part 3 : Counting Sequences of ( N ) - grams ( 2 5 % ) Using WordLengthCount.java as a starting point,

Part 3: Counting Sequences of \( N \)-grams (25\%)
Using WordLengthCount.java as a starting point, extend it to count the sequences of \( N \)-grams (call the program NgramCount.java). An \( N \)-gram is a sequence of \( N \) consecutive words, where \( N \) is the input in the command-line argument.
Output Format: In this problem, we define a word as a sequence of English alphanumeric characters (e.g., dogs, cat31, etc.). Each line contains a tuple of (1st word, 2 nd word, ...,\( N \)-th word, count), where the adjacent items are separated by non-alphanumeric characters (e.g., space, tab, newline, or punctuations). You may assume that the datasets contain only printable ASCII characters, all in English. You can output the count of each sequence in any order. Words are case-sensitive. That is, "Apple apple" and "apple apple" should be counted and displayed separately. See the example below, where \( N=2\).
Command Format:
\$ hadoop jar [.jar file][class name][input dir][output dir][N]
Sample Input:
'The quick brown fox jumps
over the lazy dogs.
Sample Output:
The quick 1
quick brown 1
brown fox 1
fox jumps 1
jumps over 1
over the 1
the lazy 1
lazy dogs 1
Notes:
- We treat the punctuations "'" and "." as delimiters. We treat only "The" and "dogs" as words.
- For a word like "ca'\( t 2 e \)", we treat it as two words "ca" and "\( t 2 e \)".
- The word in the end of a line and the word in the beginning of the next line can also form an \( N \)-gram. For example, "jumps over" is an \( N \)-gram for \( N=2\).
- Multiple non-alphanumeric characters are treated as a single delimiter. For example, "quick ,,,,, brown" will form a 2-gram "quick brown".
- We may miss some \( N \)-grams that appear in two different HDFS blocks, but it's okay.
- You don't need to implement any combining technique here, but you are strongly encouraged to think of any possible way to speed up your program. ```
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org:apache.hadoop.ios.Text;
import org.apache.hadoop:mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop:mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FilleOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper0bject, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr. = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()){
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum =0;
for (IntWritable val : values){
}
sum += val.get();
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
File0utputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0 : 1);
}
}
```
Part 3 : Counting Sequences of \ ( N \ ) - grams

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!