Question

1 Approved Answer

Posted on Sep 24, 2024

Programming Assignment I CS 5433: Big Data Management MapReduce Jobs Part 1: Using flume collect two sets of twitter data. Each dataset should be of

Programming Assignment I CS 5433: Big Data Management MapReduce Jobs

Part 1: Using flume collect two sets of twitter data. Each dataset should be of size 3MB or thereabouts. The two sets will be collected separately and use different keywords, but should be related. For example, the word Election for the first set and Biden for the second set. Store the data in HDFS in the departments Hadoop cluster. Take screen shots of i. files and directories where the tweet data is stored in HDFS ii. contents of a file in HDFS that stores the tweets. Do not display all the contents. A snapshot of one file will suffice. Before you start collecting twitter data, email the instructor the keywords you will be using. [10 marks]

Part 2: Count the number of rows in both datasets. Use MapReduce. Take a screen shot that shows the number of rows in both datasets [10 marks]

Part 3: Join the 2 twitter data sets based on common information (such as democratic, republican or some other field). Use MapReduce. The join can be on any field. Take a screen shot that shows the joined dataset. [20 marks]

Part 4: Count the number of rows in the joined dataset. Use MapReduce Take a screen shot that shows the number of rows in the joined dataset [10 marks] Collaboration Policy: You should complete this programming assignment individually. Any doubts/clarification about the questions should be directed to either the instructor/TA. Make sure you acknowledge web & other resources that you have used in your work. Note: 2. All Computer Science students should write the source code in Java. Non Computer Science students may use Java or Python. 3. All the source code you submit should be well commented [Penalty for not commenting adequately 25%] 4. Your source code should run on the Hadoop cluster in the department. Instructions to log in and collect twitter data using Flume are outlined in the document named Using Hadoop and Flume.pdf. 5. Submissions a. README File for each question [FirstName_LastName_README_x]. The readme file will give instructions to run your code and list the relevant files. b. Commented source code for each question [FirstName_LastName_Program_x] c. Report i. One page that describes your approach ii. One or two pages showing for each question screenshots of results. Include a figure number and a caption that will explain what a figure refers to. For example: Figure 1: List of files and directories in directory where twitter data is stored in HDFS d. All the source files zipped as a single zip file [FirstName_LastName_PA1.zip].