Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Parallel and Distributed Computing Lab 4 PageRank In this lab, you'll practice using the Spark join operation along with understanding partitioning / shuffles . You

Parallel and Distributed Computing

Lab

4

PageRank

In this lab, you'll practice using the Spark "join" operation along with understanding

partitioning

/

shuffles

.

You may not have to explicitly repartition anything, but you should have

a good understanding what the joins need to work efficiently. You will do this by implementing

a version of the PageRank algorithm.

PageRank Overview

PageRank was the original ranking algorithm used at Google, developed by and named after

one of Google's co

-

founders, Larry Page. It works generally as follows:

1 .

Every page has a number of points: its PageRank score.

2 .

Every page votes for the pages that it links to

,

distributing to each page a number of points

equal to the linking page

s PageRank score divided by the number of pages it links to

.

3 .

Each page then receives a new PageRank score, based on the sum of all the votes it

received.

4 .

The process is repeated until the scores are either within a tolerance or for a fixed number

of iterations.

Your Implementation

To implement PageRank you should do the following:

Load in the webpage linking data. Eventually I want you to use the short data from lab

3 .

But to

start just hardcode the following with a parallelize. It will make a nice simple test case.

["

a b c

", "

b a a

", "

c b

", "

d a

"]

That is

,

A links to B and C

.

B links to A twice. C link to B

.

And D links to A

.

Next you will need to make

2

RDDs

.

One for linking information and one for ranking

information.

The linking RDD

,

links, holds the neighbor links for each source page. Build this directly from

the file

(

or hardcoded

)

RDD

.

The following is what links should look like when collected. Note

that duplicate links from the same source page are removed

(

.

. "

b a a

") .

Note that once this

RDD is built, it will never change, but it will be used numerous times in the iterative step below.

[('

', ['

','

']), ('

', ['

']), ('

', ['

']), ('

', ['

'])]

The other RDD

,

rankings, holds the source page ranking information. This will just be pairs

connecting source page to its ranking as seen below. Initially every page should be given the

same ranking. Note that for our implementation, we want the rankings to always add up to

1 -

like a probability distribution.

[('

', 0.25), ('

', 0.25), ('

', 0.25), ('

', 0.25)]

After the initial RDDs are setup, it is time to implement the iterative step. This can be done

until the rankings stabilize or for a fixed number of iterations. We will setup our code to simply

do a fixed number of iterations.

10

iterations is the number we will be using for the demos.

Each iteration needs to figure out the contribution each source page makes to its neighbor

links. This was step

2

outlined above. Every page votes for the pages that it links to

,

distributing to each page a number of points equal to the linking page

s PageRank score divided

by the number of pages it links to

.

This sounds like a mapping. And then we need to do step

3 .

Each page then receives a new PageRank score, based on the sum of all the votes it

received. This sounds like a reduction.

However, in order to do the mapping, note that you will need both the neighbor links and the

page's ranking. Those are in different RDDs

.

And you can't just pass one RDD as the parameter

to the other RDD

'

s mapping function. Thus, you will need to join the RDDs together to form a

temporary RDD you can map over. For our demo example, that join looks like:

[('

', (['

'], 0.25)), ('

', (['

'], 0.25)), ('

', (['

','

'], 0.25)), ('

', (['

'], 0.25))]

Then after mapping to calculate the neighbor contributions, the results should look like:

[('

', 0.25), ('

', 0.25), ('

', 0.125), ('

', 0.125), ('

', 0.25)]

And finally reducing to combine the individual neighbor contributions:

[('

', 0.375), ('

', 0.125), ('

', 0.5)]

Note that these are our new rankings for the next iteration. We can throw the old ones away

and simply use the new ones calculated from the neighbor contributions. Also note that any

page that didn't have any links going into it gets dropped from this list. It will still exist as a

page in the links RDD and we can just infer its ranking is

0 .

As a final step, after the

10

iterations, you should then sort your rankings by ranking, highest to

lowest.

Demo

As stated above, it is best if you test this with hardcoded data to understand how it works. It

will run much faster if you only display the final results and not the intermediate

steps. However, I want to see the intermediate steps in your output. So for the hardcoded

demo you code should produce the following:

Initial links:

[('

', ['

','

']), ('

', ['

']), ('

', ['

']), ('

', ['

'])]

Initial rankings:

[('

', 0.25), ('

', 0.25), ('

', 0.25), ('

', 0.25)]

Iteration:

0

Joined RDD:

[('

', (['

'], 0.25)), ('

', (['

'], 0.25)), ('

', (['

','

'], 0.25)), ('

', (['

'], 0.25))]

Neighbor contributions:

[('

', 0.25), ('

', 0.25), ('

', 0.125), ('

', 0.125), ('

', 0.25)]

New rankings:

[('

', 0.375), ('

', 0.125), ('

', 0.5)]

Iteration:

1

Joined RDD:

[('

', (['

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

MFDBS 89 2nd Symposium On Mathematical Fundamentals Of Database Systems Visegrad Hungary June 26 30 1989 Proceedings

Authors: Janos Demetrovics ,Bernhard Thalheim

1989th Edition

★★★★★

developing conceptual and empirical links between theoretical formulations of cultural value dimensions and specific manifestations in HRM/people management policies and practices;

Answered: 1 week ago

Previous Question Next Question