2. Project Abstract Instructor: Prof. Reddy Raja Mentor: Ms M.Padmini To Implement PageRank Algorithm using Map-Reduce for Wikipedia and verify it for smaller data-sets
19. Algorithm Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. Google calculates a page's importance from the votes cast for it. How important each vote is also taken into account when a page's PageRank is calculated.
30. PageRank Equation(Enhancement) Solution for Cycles and If a random surfer gets bored Here ‘d ‘ is known as damping factor . It represents the probability, at any step, that the person will continue surfing . The value of ‘d’ is typically kept 0.85
32. In other words In a simpler way:- a page's PageRank = 0.15 /N+ 0.85 * (a "share" of the PageRank of every page that links to it) "share" = the linking page's PageRank divided by the number of outbound links on the page. And N=the number of documents in collection The equation of PageRank shows clearly how a page's PageRank is arrived at. But what isn't immediately obvious is that it can't work if the calculation is done just once.
33. PageRank Equation-as per the published paper :“The Anatomy of a Large-Scale Hyper textual Web Search Engine”-Sergey Brin and Lawrence Page We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) ->Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.
34. IssuesIn the Original Formula Formula given in the in Page and Brin's paper does not supports the statement that "the sum of all PageRanks is one“ Hence to support the statement the formula is modified as: PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) where N=the number of documents in collection
43. Brief Description of Project(Contd.) Output: The output file consist of records containing the url of the page(from Url), the page rank value of the page(PRValue) and the list of urls to which the page points to(ToUrlList). FinalOutput.txt ToUrlList fromUrl PRValue
44. Brief Description of ProjectModules Web Graph Module1: Converter Module2: PageRank Calculator Module3: Output Analyzer Converter Iterate until convergence PageRank Calculator ... Search Engine Output Analyzer Create Index
53. Module1: ConverterIssues Self Loops: -handled by checking the FromUrl with ToUrl before sending it to the reduce function Dangling Pages: -handled by initializing their PRValue with 1/N and the List of ToUrls is left blank.
62. Module2: PageRank Calculator Map: Input: index.html PRValueOutList: < 1.html 2.html... > Output 1. Output for each outlink: key: “1.html” value: PRValue/ ListLength (Vote Share) 2. ToUrl itself key: index.html value: <OutList> Reduce Input: Key: “1.html” Value: 0.5 23Value: 0.24 2……. Value : UrlList <OutLink> Output: Key: “1.html” Value: “<new pagerank> <OutList> 1.html 2.html...” Start with the initial PageRank and Outlinksof a document.
63. Module2: PageRank Calculator Map: Input: index.html PRValueOutList: < 1.html 2.html... > Output 1. Output for each outlink: key: “1.html” value: PRValue/ ListLength (Vote Share) 2. ToUrl itself key: index.html value: <OutList> Reduce Input: Key: “1.html” Value: 0.5 23Value: 0.24 2……. Value : UrlList <OutLink> Output: Key: “1.html” Value: “<new pagerank> <OutList> 1.html 2.html...” For each Outlink, output the PageRank’s share of the Inlinks, and List of outlinks.
64. Module2: PageRank Calculator Map: Input: index.html PRValueOutList: < 1.html 2.html... > Output 1. Output for each outlink: key: “1.html” value: PRValue/ ListLength (Vote Share) 2. ToUrl itself key: index.html value: <OutList> Reduce Input: Key: “1.html” Value: 0.5 23Value: 0.24 2……. Value : UrlList <OutLink> Output: Key: “1.html” Value: “<new pagerank> <OutList> 1.html 2.html...” Now the reducer has a Url of document, all the inlinks to that document and their corresponding PageRank’s share and List of outlinks.
65. Module2: PageRank Calculator Map: Input: index.html PRValueOutList: < 1.html 2.html... > Output 1. Output for each outlink: key: “1.html” value: PRValue/ ListLength (Vote Share) 2. ToUrl itself key: index.html value: <OutList> Reduce Input: Key: “1.html” Value: 0.5 23Value: 0.24 2……. Value : UrlList <OutLink> Output: Key: “1.html” Value: “<new pagerank> <OutList> 1.html 2.html...” Compute the new PageRank and output in the same format as the input.
66. Module2: PageRank Calculator Map: Input: index.html PRValueOutList: < 1.html 2.html... > Output 1. Output for each outlink: key: “1.html” value: PRValue/ ListLength (Vote Share) 2. ToUrl itself key: index.html value: <OutList> Reduce Input: Key: “1.html” Value: 0.5 23Value: 0.24 2……. Value : UrlList <OutLink> Output: Key: “1.html” Value: “<new pagerank> <OutList> 1.html 2.html...” Now iterate until convergence (determined by the precision value).
67. Module2: PageRank Calculator IssuesCatch22 Situation Suppose we have 2 pages, A and B, which link to each other, and neither have any other links of any kind. This is what happens:- Step 1: Calculate page A's PageRank from the value of its inbound links Step 2: Calculate page B's PageRank from the value of its inbound links we can't work out A's PageRank until we know B's PageRank, and we can't work out B's PageRank until we know A's PageRank. Thus the PageRank of A and B will be inaccurate.
68. Module2: PageRank Calculator IssuesCatch22 situation (solution) This problem is overcome by repeating the calculations many times. Each time produces slightly more accurate values. In fact, total accuracy can never be achieved because the calculations are always based on inaccurate values. The number of iterations should be sufficient to reach a point where any further iterations wouldn't produce enough of a change to the values to matter. => Use “delta function” which will keep track of changes in the PageRank of all the pages and if the change in PageRank of all the pages is less than the value specified by the user the iterations can be stopped.