Answered step by step
Verified Expert Solution
Question
1 Approved Answer
DSCI 6 1 7 HW 0 2 Instructions General Instructions Navigate to the Homework folder inside of your user directory in the Databricks workspace. Create
DSCI HW Instructions
General Instructions
Navigate to the Homework folder inside of your user directory in the Databricks workspace. Create a notebook
named HW inside the Homework folder.
Any set of instructions you see in this document with an orange bar to the left will indicate a place where you
should create a markdown cell. For each new problem, create a markdown cell that indicates the title of that
problem as a level header. Any set of instructions you see with a blue bar to the left will provide instructions
for creating a single code cell.
Add a markdown cell that displays the following text as a level header: DSCI Homework Within the
same cell, on the line below the header, add your name in bold.
Add a code cell that imports the SparkSession class and the pandas package. Use the standard alias for
pandas. Also import the punctuation string from the string library.
Add another code cell to create SparkSession and sparkContext objects named spark and sc
Problem : Word Count
In the next few problems, we will work with a text file that contains the complete works of William Shakespeare.
The data file using for this problem is located at: FileStoretablesshakespearecomplete.txt
We will begin by loading and processing the file and tokenizing the lines into individual words.
Complete the following steps in a single code cell:
Read the contents of the file shakespearecomplete.txt into an RDD named wslines.
Create an RDD named wswords by applying the transformations described below. This will require
several uses of map and flatMap and a single call to filter Try to chain together the
transformations together to complete all of these steps with a single statement that will likely span
multiple lines
Tokenize the strings in wslines by splitting them on the characters in the following list:
:t
The resulting RDD should consist of strings rather than lists of strings. This will require
multiple separate uses of flatMap and split
Use the Python string method strip with the punctuation string to remove common
punctuation symbols from the start and end of the tokens. Then use strip again with the
string to remove numbers from the start and end of the tokens.
Code cell continued on next page.
Code cell continued from previous page.
Use the Python string method replace to replaces instances of the single
quoteapostrophe with the empty string
Convert all strings to lower case using the lower string method.
The steps above will create some empty strings of the form within the RDD Filter out
these empty strings.
Create a second RDD named distwords that contains only one copy of each word found in
wswords.
Print the number of words in wswords and the number of distinct words using the format shown
below. Add spacing so that the numbers are leftaligned.
Total Number of Words: xxxx
Number of Distinct Words: xxxx
We will now use sample to get a sense as to the types of words found in wswords.
Draw a sample from wswords using the arguments withReplacementFalse and fraction
Collect and print the results.
Problem : Longest Words
We will now find the longest words used by Shakespeare. We will start by looking for the single longest word.
Complete the following steps in a single code cell:
Write a Python function with two parameters, both of which are intended to be strings. The function
should return the longer of the two strings. If the strings are the same length, then the function
should return the word that appears later when ordered lexicographically alphabetically
Use the function you wrote along with reduce to find the longest word in the RDD distwords.
Print the result.
We will now find the longest words used by Shakespeare.
Use sortBy with the Python len function to sort the elements of distwords according to their
length, with longer words appearing first. Print the first elements of this RDD
Problem : Word Frequency
We will now create a frequency distribution for the words appearing in our document in order to determine
which words were used most frequently by Shakespeare.
Complete the following steps in a single code cell:
Create an RDD named pairs. This RDD should consist of tuples of the form x where x is a
word in wswords. The RDD pairs should contain one element for each element of wswords.
Use reduceByKey to group the pairs together according to their first elements the words
summing together the integers stored in the second element the s This will produce an RDD with
one pair for each distinct word. The first element will be the word and the second element will be a
count for that word. Sort this RDD by the second tuple element the count in descending order.
Name the resulting RD
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started