Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

In this assignment you will be using Apache Spark to perform data analysis. You may use Java, Scala or Python to run the Spark queries.

In this assignment you will be using Apache Spark to perform data analysis.

  • You may use Java, Scala or Python to run the Spark queries.
  • You must include screenshots of the queries and the results.
  • You may use Azure HDInsight or a local installation of Bitnami Hadoop or Hortonworks or Cloudera Hadoop distribution for this assignment

(1) Connect to the cluster

(2) Start a command line and ensure that the 'spark' and 'hdfs' commands are working

(3) Part 1:

  1. Put the sample data from Week 5 Paper 2 ( counting of the word Sentence) into a text file
  2. This is test sentence number one. This is test sentence number 2. This is test sentence number three. This is sentence no 4. sentence 5.
  3. Upload the file into HDFS
  4. Use Spark CLI (spark-shell or pyspark) or the Zeppelin notebook for your commands/queries
  5. Run the Spark transformations and actions (for example, filter, map, reduce etc..) to count the number of times the word "Sentence" appears in the file

(4) Part 2:

  1. Using the hdfs commands, upload the Baseball data files into an HDFS folder such as /temp ( or /tmp)
  2. Use the Spark command line interface (CLI) or the Jupyter or Zeppelin Notebook to answer the following questions:
    1. What is the total number of baseball players?
    2. How many players were born before 1960?
    3. How many players were born in or after 1960?
    4. How many players were born outside of the USA?
    5. How many players were born in the USA?
    6. Use the following actions and explain the output
      1. collect
      2. take(3)
      3. distinct

(5) Capture the screen shots of the command execution and the results

(6) Provide a write up of the commands and the results.

IMPORTANT: Don't forget to shutdown the Azure cluster

Requirements for the assignments:

  • The assignment and write up is due by end of Week 7.
  • Assignment file must have a .doc or .docx extension; screen shots should be in .jpg, .gif, or .pdf

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Semantics Of A Networked World Semantics For Grid Databases First International Ifip Conference Icsnw 2004 Paris France June 2004 Revised Selected Papers Lncs 3226

Authors: Mokrane Bouzeghoub ,Carole Goble ,Vipul Kashyap ,Stefano Spaccapietra

2004 Edition

3540236090, 978-3540236092

More Books

Students also viewed these Databases questions