Question

1 Approved Answer

Posted on Sep 22, 2024

These questions are related to the PageRank algorithm as well as Hadoop or Hadoop like things. Problem 5. (PageRank) The math equation of one page

These questions are related to the PageRank algorithm as well as Hadoop or Hadoop like things.

image text in transcribed

Problem 5. (PageRank) The math equation of one page for PageRank is: r(Page;) = S 4 (Page) (1-) pageln Page, EI Page where r(Page)) represents the PageRank value of Page, Ipage, represents the set of pages pointing into Page. O page, represents the number of outlinks from Page,. c E (0,1) represents the decay factor, and n represents the number of web pages in the web graph. The decay factor is usually set to 0.85, i.e., c = 0.85. If we write out the equations for all the nodes, we have a system of linear equations. ... method is usually used to solve the system of linear equations. The time complexity is o .), where represents the number of iterations, and m represents the number of edges in the graph. To design the MapReduce algorithm for implementing the power iteration method for computing PageRank, we need to first figure out the Mapper and Reducer. from Page, and I to represent the Let us use 0 to represent the set of set of pointing into Page;. Each mapper will read one line of the adjacency list file. That means each mapper knows all the outlinks of a node. Mapper PageRank value of node Input Outlinks 0, of node i one row of the adjacentymatrix Output for each node) E O output Each mapper processes a single nodei. Each mapper knows the outlinks of node . For each out neighbor nodeje 0, the mapper will output a key-value pair, where the key is the index of nodej, and the value is the value 1-1. Reducer: Input > where node i El Output wherer Each reducer processes a single node/. During the shuffle processes, all the key-value pairs with the same _will be sent to the same Therefore, each reducer knows the of node). For each in-neighbor node te the reducer knows the value . The reducer willsum these values together. The reducer will then calculates the value of the PageRank equation r = c + The reducer will output a key-value pair, where the key is the index of node and the value is the value r' = Elet, Since the mapper needs to read two files, therefore we use the technique to read the PageRank value file, and use the inputFileFormat to read one line in the adjacency list file The MapReduce program will repeat this process many times in order to calculate the final PageRank values. After each iteration, the intermediate PageRank values will written into the At the beginning of each iteration, the program will read the intermediate PageRank values from the Can we use combiner in the program? If so, can we use the code of reducer as the combiner? If not, please describe the idea of combiner, Combiner: Input , a list of >, where node i El value_ Output where -= Liels Do we need to change the code of reducer if we use this combiner? Problem 6. (Skewed Distribution) In the WordCount example, we have the skewed distribution problem. The frequency of some commonly used words are much higher than that of barely used words. The tasks for processing some keys take significantly longer time than other tasks. Please briefly describe your idea for solving the skewed distribution problem such that the work loads are balanced across all the reducers. Problem 5. (PageRank) The math equation of one page for PageRank is: r(Page;) = S 4 (Page) (1-) pageln Page, EI Page where r(Page)) represents the PageRank value of Page, Ipage, represents the set of pages pointing into Page. O page, represents the number of outlinks from Page,. c E (0,1) represents the decay factor, and n represents the number of web pages in the web graph. The decay factor is usually set to 0.85, i.e., c = 0.85. If we write out the equations for all the nodes, we have a system of linear equations. ... method is usually used to solve the system of linear equations. The time complexity is o .), where represents the number of iterations, and m represents the number of edges in the graph. To design the MapReduce algorithm for implementing the power iteration method for computing PageRank, we need to first figure out the Mapper and Reducer. from Page, and I to represent the Let us use 0 to represent the set of set of pointing into Page;. Each mapper will read one line of the adjacency list file. That means each mapper knows all the outlinks of a node. Mapper PageRank value of node Input Outlinks 0, of node i one row of the adjacentymatrix Output for each node) E O output Each mapper processes a single nodei. Each mapper knows the outlinks of node . For each out neighbor nodeje 0, the mapper will output a key-value pair, where the key is the index of nodej, and the value is the value 1-1. Reducer: Input > where node i El Output wherer Each reducer processes a single node/. During the shuffle processes, all the key-value pairs with the same _will be sent to the same Therefore, each reducer knows the of node). For each in-neighbor node te the reducer knows the value . The reducer willsum these values together. The reducer will then calculates the value of the PageRank equation r = c + The reducer will output a key-value pair, where the key is the index of node and the value is the value r' = Elet, Since the mapper needs to read two files, therefore we use the technique to read the PageRank value file, and use the inputFileFormat to read one line in the adjacency list file The MapReduce program will repeat this process many times in order to calculate the final PageRank values. After each iteration, the intermediate PageRank values will written into the At the beginning of each iteration, the program will read the intermediate PageRank values from the Can we use combiner in the program? If so, can we use the code of reducer as the combiner? If not, please describe the idea of combiner, Combiner: Input , a list of >, where node i El value_ Output where -= Liels Do we need to change the code of reducer if we use this combiner? Problem 6. (Skewed Distribution) In the WordCount example, we have the skewed distribution problem. The frequency of some commonly used words are much higher than that of barely used words. The tasks for processing some keys take significantly longer time than other tasks. Please briefly describe your idea for solving the skewed distribution problem such that the work loads are balanced across all the reducers