Answered step by step
Verified Expert Solution
Question
1 Approved Answer
In this assignment you will be using Apache Spark to perform data analysis. You may use Java, Scala or Python to run the Spark queries.
In this assignment you will be using Apache Spark to perform data analysis.
- You may use Java, Scala or Python to run the Spark queries.
- You must include screenshots of the queries and the results.
- You may use Azure HDInsight or a local installation of Bitnami Hadoop or Hortonworks or Cloudera Hadoop distribution for this assignment
(1) Connect to the cluster
(2) Start a command line and ensure that the 'spark' and 'hdfs' commands are working
(3) Part 1:
- Put the sample data from Week 5 Paper 2 ( counting of the word Sentence) into a text file
- This is test sentence number one. This is test sentence number 2. This is test sentence number three. This is sentence no 4. sentence 5.
- Upload the file into HDFS
- Use Spark CLI (spark-shell or pyspark) or the Zeppelin notebook for your commands/queries
- Run the Spark transformations and actions (for example, filter, map, reduce etc..) to count the number of times the word "Sentence" appears in the file
(4) Part 2:
- Using the hdfs commands, upload the Baseball data files into an HDFS folder such as /temp ( or /tmp)
- Use the Spark command line interface (CLI) or the Jupyter or Zeppelin Notebook to answer the following questions:
- What is the total number of baseball players?
- How many players were born before 1960?
- How many players were born in or after 1960?
- How many players were born outside of the USA?
- How many players were born in the USA?
- Use the following actions and explain the output
- collect
- take(3)
- distinct
(5) Capture the screen shots of the command execution and the results
(6) Provide a write up of the commands and the results.
IMPORTANT: Don't forget to shutdown the Azure cluster
Requirements for the assignments:
- The assignment and write up is due by end of Week 7.
- Assignment file must have a .doc or .docx extension; screen shots should be in .jpg, .gif, or .pdf
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started