Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Any language In this assignment, you will be designing and implementing a distributed MapReduce system. You must write code that implements MapReduce as a library

Any language

In this assignment, you will be designing and implementing a distributed MapReduce system. You must write code that implements MapReduce as a "library" that higher level applications can then use for data processing.

2 What To Implement

2.1 Processes as nodes

Nodes and tasks are implemented as conventional OS processes. Your application must work on a single OS, as well as on a cluster of machines. Thus, you can only use network-based communication. Each task must communicate with each others through the use of RPC's or through network sockets.

2.2 Data storage

Since we haven't built a distributed file system (yet), you can use a local file system for storing data. You can assume that each node/process has a private data store in form of a directory, and other processes cannot directly access this directory. Remember that all communication must be over network sockets and RPCs, so a process writing a file into another's local data store is not supported in a truly distributed system. For instance, /home/p1/ is the local disk storage of the p1 process, and the process cannot directly read other directories (/home/p2/) using file-system system calls.

2.3 Master node

The first thing your program does should be to spawn the master node. The master node then spawns the map and reducer tasks, and controls and coordinates all other processes.

2.4 API

The user should only interact with the master node through a well-defined API. You must think about what kind of an API the master should expose. For instance, it can be a long-running HTTP server, and you can interact with it through your web-browser. Or, it can be passed parameters through configuration files whose locations are known.

At a minimum, the master will need to support this external interface:

int cluster_id = init_cluster([(ip-address, port)])

run_mapred(input_data, map_fn, reduce_fn, output_location)

destroy_cluster(cluster_id)

2.5 Fault-tolerance

Your system should be able to survive process failures. You can choose either data replication or restarting the killed process. You can choose a method of convenience to get this done.

2.6 Applications

The two main applications you must implement are:

  • Word-count
  • Inverted index

You can download input datasets from Project Gutenberg.

In addition, you must also implement an application does multiple rounds of map-reduce. That is, the output data of the reduce stage is the input data for a new round of map-reduce processing.

2.7 Other implementation notes

Providing a makefile and configuration files and scripts for your project is a must. You will not get any points if we cannot easily compile, run, and test your code.

Similarly, you must provide a few test cases.

Avoid hard-coding anything in your program. This includes IP addresses, port numbers, file-paths, etc. Use command line input parameters, or better yet, use a configuration file. As an example, you can look at the configuration options in Hadoop.

3 Testing

It is important that you provide multiple test applications for your system. Otherwise, we will have no way of evaluating your submission.

These test cases should be sample MapReduce applications that we have discussed in class. Therefore, you must provide input-data (or input data generation scripts), and the map and reduce functions. We should be able to run these examples without manual interventionso its crucial to not hardcode any paths or ip addresses.

At a minimum, you must implement the word-count and inverted-index examples. You can use a books from the gutenberg archive as documents.

In addition, you must also test for fault-tolerance, by killing a few processes and showing how your system responds.

Distributed systems are hard to test for correctness. One thing that can help is to log important events. This log of events is useful for debugging and evaluation, so please include the log output of your sample applications in your submission.

4 Report

You must clearly document all facets of your design, with regards to: data parititioning, fault-tolerance, and dynamic membership. Furthermore, you must also describe some of the implementation assumptions. You must carefully document all the communication between the master and workers.

Note that a minimum working implementation of this assignment is fairly straight forward. Points will be awarded for clear design and implementation.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions