Any language In this assignment, you will be designing and implementing a distributed MapReduce system You must write code that implements MapReduce as a library that higher level applications can then use for data processing 2 What To Implement 2 1 Processes as nodes Nodes and tasks are implemented as conventional OS processes Your application must work on a single OS, as well as on a cluster of machines Thus, you can only use network based communication Each task must communicate with each others through the use of RPC's or through network sockets 2 2 Data storage Since we haven't built a distributed file system (yet), you can use a local file system for storing data You can assume that each node process has a private data store in form of a directory, and other processes cannot directly access this directory Remember that all communication must be over network sockets and RPCs, so a process writing a file into another's local data store is not supported in a truly distributed system For instance, home p1 is the local disk storage of the p1 process, and the process cannot directly read other directories ( home p2 ) using file system system calls 2 3 Master node The first thing your program does should be to spawn the master node The master node then spawns the map and reducer tasks, and controls and coordinates all other processes 2 4 API The user should only interact with the master node through a well defined API You must think about what kind of an API the master should expose For instance, it can be a long running HTTP server, and you can interact with it through your web browser Or, it can be passed parameters through configuration files whose locations are known At a minimum, the master will need to support this external interface int cluster id init cluster( (ip address, port) ) run mapred(input data, map fn, reduce fn, output location) destroy cluster(cluster id) 2 5 Fault tolerance Your system should be able to survive process failures You can choose either data replication or restarting the killed process You can choose a method of convenience to get this done 2 6 Applications The two main applications you must implement are Word count Inverted index You can download input datasets from Project Gutenberg In addition, you must also implement an application does multiple rounds of map reduce That is, the output data of the reduce stage is the input data for a new round of map reduce processing 2 7 Other implementation notes Providing a makefile and configuration files and scripts for your project is a must You will not get any points if we cannot easily compile, run, and test your code Similarly, you must provide a few test cases Avoid hard coding anything in your program This includes IP addresses, port numbers, file paths, etc Use command line input parameters, or better yet, use a configuration file As an example, you can look at the configuration options in Hadoop 3 Testing It is important that you provide multiple test applications for your system Otherwise, we will have no way of evaluating your submission These test cases should be sample MapReduce applications that we have discussed in class Therefore, you must provide input data (or input data generation scripts), and the map and reduce functions We should be able to run these examples without manual interventionso its crucial to not hardcode any paths or ip addresses At a minimum, you must implement the word count and inverted index examples You can use a books from the gutenberg archive as documents In addition, you must also test for fault tolerance , by killing a few processes and showing how your system responds Distributed systems are hard to test for correctness One thing that can help is to log important events This log of events is useful for debugging and evaluation, so please include the log output of your sample applications in your submission 4 Report You must clearly document all facets of your design, with regards to data parititioning, fault tolerance, and dynamic membership Furthermore, you must also describe some of the implementation assumptions You must carefully document all the communication between the master and workers Note that a minimum working implementation of this assignment is fairly straight forward Points will be awarded for clear design and implementation

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 10, 2024

Any language In this assignment, you will be designing and implementing a distributed MapReduce system. You must write code that implements MapReduce as a library

Any language

In this assignment, you will be designing and implementing a distributed MapReduce system. You must write code that implements MapReduce as a "library" that higher level applications can then use for data processing.

2 What To Implement

2.1 Processes as nodes

Nodes and tasks are implemented as conventional OS processes. Your application must work on a single OS, as well as on a cluster of machines. Thus, you can only use network-based communication. Each task must communicate with each others through the use of RPC's or through network sockets.

2.2 Data storage

Since we haven't built a distributed file system (yet), you can use a local file system for storing data. You can assume that each node/process has a private data store in form of a directory, and other processes cannot directly access this directory. Remember that all communication must be over network sockets and RPCs, so a process writing a file into another's local data store is not supported in a truly distributed system. For instance, /home/p1/ is the local disk storage of the p1 process, and the process cannot directly read other directories (/home/p2/) using file-system system calls.

2.3 Master node

The first thing your program does should be to spawn the master node. The master node then spawns the map and reducer tasks, and controls and coordinates all other processes.

2.4 API

The user should only interact with the master node through a well-defined API. You must think about what kind of an API the master should expose. For instance, it can be a long-running HTTP server, and you can interact with it through your web-browser. Or, it can be passed parameters through configuration files whose locations are known.

At a minimum, the master will need to support this external interface:

int cluster_id = init_cluster([(ip-address, port)])

run_mapred(input_data, map_fn, reduce_fn, output_location)

destroy_cluster(cluster_id)

2.5 Fault-tolerance

Your system should be able to survive process failures. You can choose either data replication or restarting the killed process. You can choose a method of convenience to get this done.

2.6 Applications

The two main applications you must implement are:

Word-count
Inverted index

You can download input datasets from Project Gutenberg.

In addition, you must also implement an application does multiple rounds of map-reduce. That is, the output data of the reduce stage is the input data for a new round of map-reduce processing.

2.7 Other implementation notes

Providing a makefile and configuration files and scripts for your project is a must. You will not get any points if we cannot easily compile, run, and test your code.

Similarly, you must provide a few test cases.

Avoid hard-coding anything in your program. This includes IP addresses, port numbers, file-paths, etc. Use command line input parameters, or better yet, use a configuration file. As an example, you can look at the configuration options in Hadoop.

3 Testing

It is important that you provide multiple test applications for your system. Otherwise, we will have no way of evaluating your submission.

These test cases should be sample MapReduce applications that we have discussed in class. Therefore, you must provide input-data (or input data generation scripts), and the map and reduce functions. We should be able to run these examples without manual interventionso its crucial to not hardcode any paths or ip addresses.

At a minimum, you must implement the word-count and inverted-index examples. You can use a books from the gutenberg archive as documents.

In addition, you must also test for fault-tolerance, by killing a few processes and showing how your system responds.

Distributed systems are hard to test for correctness. One thing that can help is to log important events. This log of events is useful for debugging and evaluation, so please include the log output of your sample applications in your submission.

4 Report

You must clearly document all facets of your design, with regards to: data parititioning, fault-tolerance, and dynamic membership. Furthermore, you must also describe some of the implementation assumptions. You must carefully document all the communication between the master and workers.

Note that a minimum working implementation of this assignment is fairly straight forward. Points will be awarded for clear design and implementation.