Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Goal: In this assignment, we will compute PageRank score for the web dataset provided by Google in a programming challenge in a programming constest in

Goal: In this assignment, we will compute PageRank score for the web dataset provided by Google in a programming challenge in a programming constest in 2002. Input Format: The datasets are given in txt. The file format is:

  • Rows from 1 to 4: Metadata. They give information about the dataset and are self-explained.
  • Following rows: each row consists of 2 values represents the link from the web page in the 1st column to the web page in the 2nd column. For example, if the row is 0 11342, this means there is a directed link from the page id 0 to the page id 11324.

There are two dataset that we will work with in this assignment.

  1. web-Google_10k.txt: This dataset contains 10,000 web pages and 78323 links. The dataset can be downloaded from here. DO NOT assume that page ids are from 0 to 10,000.
  2. web-Google.txt: This dataset contains 875,713 web pages and 5,105,039 links. The dataset can be downloaded from here. DO NOT assume that page ids are from 0 to 875,713.

Also, it's helpful to test your algorithm with this toy dataset. Output Format: the output format for each quesion will be specified below. There are two questions in this assigment worth 50 points total. use pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse python Question 1 (20 points): Find all dead ends. A node is a dead end if it has no out-going edges or all its outoging edges points to dead ends. For example, consider the graph A->B->C->D. All nodes A,B,C,D are dead ends by this definition. D is a dead end because it has no outgoing edge. C is a dead end because its only out-going neighbor, D, is a dead end. B is a dead end for the same reason, so is A.use pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse pythonuse python

  1. (10 points) Find all dead ends of the dataset web-Google_10k.txt. For full score, your algorithm must run inless than 15 seconds. The output must be written to a file named deadends_10k.tsv
  2. (10 points) Find all dead ends of the dataset web-Google_800k.txt. For full score, your algorithm must run in less than 1 minute. The output must be written to a file named deadends_800k.tsv

The output format for Question 1 is single column, where each column is the id of an dead end. See here for a sample output for the toy dataset. Question 2 (30 points): Implement the PageRank algorithm for both datasets. The taxation parameter for both dataset is = 0.85 and the number of PageRank iterations is T = 10.

  1. (15 points)Run your algorithm for web-Google_10k.txt dataset. For full score, your algorithm must run in less than 30 seconds. The output must be written to a file named PR_10k.tsv
  2. (15 points)Run your algorithm for web-Google.txt dataset. For full score, your algorithm must run in less than 2 minutes. The output must be written to a file named PR_800k.tsv

The output format for Question 2 is two-column:

  • The first column is the PageRank score.
  • The second column is the corresponding web page id.

The output must be sorted by descending order of the PageRank scores. Here is a sample output for the toy dataset above.

PageRank Ids0.32454706832136704 00.3002013029682813 50.24391355866172854 40.22515097722621097 30.22515097722621097 20.22515097722621097 1

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Oracle 10g Database Administrator Implementation And Administration

Authors: Gavin Powell, Carol McCullough Dieter

2nd Edition

1418836656, 9781418836658

More Books

Students also viewed these Databases questions

Question

13-1 How does building new systems produce organizational change?

Answered: 1 week ago