Question
Any language In this assignment, you will be designing and implementing a distributed MapReduce system. You must write code that implements MapReduce as a library
Any language
In this assignment, you will be designing and implementing a distributed MapReduce system. You must write code that implements MapReduce as a "library" that higher level applications can then use for data processing.
2 What To Implement
2.1 Processes as nodes
Nodes and tasks are implemented as conventional OS processes. Your application must work on a single OS, as well as on a cluster of machines. Thus, you can only use network-based communication. Each task must communicate with each others through the use of RPC's or through network sockets.
2.2 Data storage
Since we haven't built a distributed file system (yet), you can use a local file system for storing data. You can assume that each node/process has a private data store in form of a directory, and other processes cannot directly access this directory. Remember that all communication must be over network sockets and RPCs, so a process writing a file into another's local data store is not supported in a truly distributed system. For instance, /home/p1/ is the local disk storage of the p1 process, and the process cannot directly read other directories (/home/p2/) using file-system system calls.
2.3 Master node
The first thing your program does should be to spawn the master node. The master node then spawns the map and reducer tasks, and controls and coordinates all other processes.
2.4 API
The user should only interact with the master node through a well-defined API. You must think about what kind of an API the master should expose. For instance, it can be a long-running HTTP server, and you can interact with it through your web-browser. Or, it can be passed parameters through configuration files whose locations are known.
At a minimum, the master will need to support this external interface:
int cluster_id = init_cluster([(ip-address, port)])
run_mapred(input_data, map_fn, reduce_fn, output_location)
destroy_cluster(cluster_id)
2.5 Fault-tolerance
Your system should be able to survive process failures. You can choose either data replication or restarting the killed process. You can choose a method of convenience to get this done.
2.6 Applications
The two main applications you must implement are:
- Word-count
- Inverted index
You can download input datasets from Project Gutenberg.
In addition, you must also implement an application does multiple rounds of map-reduce. That is, the output data of the reduce stage is the input data for a new round of map-reduce processing.
2.7 Other implementation notes
Providing a makefile and configuration files and scripts for your project is a must. You will not get any points if we cannot easily compile, run, and test your code.
Similarly, you must provide a few test cases.
Avoid hard-coding anything in your program. This includes IP addresses, port numbers, file-paths, etc. Use command line input parameters, or better yet, use a configuration file. As an example, you can look at the configuration options in Hadoop.
3 Testing
It is important that you provide multiple test applications for your system. Otherwise, we will have no way of evaluating your submission.
These test cases should be sample MapReduce applications that we have discussed in class. Therefore, you must provide input-data (or input data generation scripts), and the map and reduce functions. We should be able to run these examples without manual interventionso its crucial to not hardcode any paths or ip addresses.
At a minimum, you must implement the word-count and inverted-index examples. You can use a books from the gutenberg archive as documents.
In addition, you must also test for fault-tolerance, by killing a few processes and showing how your system responds.
Distributed systems are hard to test for correctness. One thing that can help is to log important events. This log of events is useful for debugging and evaluation, so please include the log output of your sample applications in your submission.
4 Report
You must clearly document all facets of your design, with regards to: data parititioning, fault-tolerance, and dynamic membership. Furthermore, you must also describe some of the implementation assumptions. You must carefully document all the communication between the master and workers.
Note that a minimum working implementation of this assignment is fairly straight forward. Points will be awarded for clear design and implementation.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started