Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Parallel and Distributed Computing Lab 4 PageRank In this lab, you'll practice using the Spark join operation along with understanding partitioning / shuffles . You
Parallel and Distributed Computing
Lab
PageRank
In this lab, you'll practice using the Spark "join" operation along with understanding
partitioningshuffles You may not have to explicitly repartition anything, but you should have
a good understanding what the joins need to work efficiently. You will do this by implementing
a version of the PageRank algorithm.
PageRank Overview
PageRank was the original ranking algorithm used at Google, developed by and named after
one of Google's cofounders, Larry Page. It works generally as follows:
Every page has a number of points: its PageRank score.
Every page votes for the pages that it links to distributing to each page a number of points
equal to the linking pages PageRank score divided by the number of pages it links to
Each page then receives a new PageRank score, based on the sum of all the votes it
received.
The process is repeated until the scores are either within a tolerance or for a fixed number
of iterations.
Your Implementation
To implement PageRank you should do the following:
Load in the webpage linking data. Eventually I want you to use the short data from lab But to
start just hardcode the following with a parallelize. It will make a nice simple test case.
a b cb a ac bd a
That is A links to B and C B links to A twice. C link to B And D links to A
Next you will need to make RDDs One for linking information and one for ranking
information.
The linking RDD links, holds the neighbor links for each source page. Build this directly from
the file or hardcoded RDD The following is what links should look like when collected. Note
that duplicate links from the same source page are removed egb a a Note that once this
RDD is built, it will never change, but it will be used numerous times in the iterative step below.
abcbacbda
The other RDD rankings, holds the source page ranking information. This will just be pairs
connecting source page to its ranking as seen below. Initially every page should be given the
same ranking. Note that for our implementation, we want the rankings to always add up to
like a probability distribution.
abcd
After the initial RDDs are setup, it is time to implement the iterative step. This can be done
until the rankings stabilize or for a fixed number of iterations. We will setup our code to simply
do a fixed number of iterations. iterations is the number we will be using for the demos.
Each iteration needs to figure out the contribution each source page makes to its neighbor
links. This was step outlined above. Every page votes for the pages that it links to
distributing to each page a number of points equal to the linking pages PageRank score divided
by the number of pages it links to This sounds like a mapping. And then we need to do step
Each page then receives a new PageRank score, based on the sum of all the votes it
received. This sounds like a reduction.
However, in order to do the mapping, note that you will need both the neighbor links and the
page's ranking. Those are in different RDDs And you can't just pass one RDD as the parameter
to the other RDDs mapping function. Thus, you will need to join the RDDs together to form a
temporary RDD you can map over. For our demo example, that join looks like:
bacbabcda
Then after mapping to calculate the neighbor contributions, the results should look like:
abbca
And finally reducing to combine the individual neighbor contributions:
bca
Note that these are our new rankings for the next iteration. We can throw the old ones away
and simply use the new ones calculated from the neighbor contributions. Also note that any
page that didn't have any links going into it gets dropped from this list. It will still exist as a
page in the links RDD and we can just infer its ranking is
As a final step, after the iterations, you should then sort your rankings by ranking, highest to
lowest.
Demo
As stated above, it is best if you test this with hardcoded data to understand how it works. It
will run much faster if you only display the final results and not the intermediate
steps. However, I want to see the intermediate steps in your output. So for the hardcoded
demo you code should produce the following:
Initial links: abcbacbda
Initial rankings: abcd
Iteration:
Joined RDD: bacbabcda
Neighbor contributions: abbca
New rankings: bca
Iteration:
Joined RDD: b
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started