SlideShare a Scribd company logo
1 of 4
Download to read offline
Pagerank on Mapreduce (With Example)
Song Liu, sl9885@bristol.ac.uk


Pagerank is a Good Thing.

It can indicate the internal authority of each node in a network by investigating the relationships
between them.

The original formula is can be seen here
http://en.wikipedia.org/wiki/PageRank

For simplicity, it can be written in this way: Rv=beta * Sigma(Rn/Nn)+ (1-beta)/N

Where Rv is the Pagerank of PageV and Rn represents the Pagerank of PageN which links to
PageV, and Nn is the number of total links of PageN. By summing up (Rn/Nn) for all inlinks of
PageV, we can get the initial Rank of PageV. To normalize this rank, we use two parameters. Beta
is a parameter which is called "damping coefficient", and N is the total number of webpages.

Again, for simplicity, we don’t consider the normalization: the "damping coefficient" beta and
total number N.


Let's Parallelize It

We assume the input follows the logical pattern below:

PageA -> Page1, Page2, Page3...
PageB -> page1', Page2', Page3'...

Before the first iteration, we need to assign the initial rank for each webpage. To my experience,
0.5 is a good number for small website or network. In real calculation, this assignment can be
done by a single-pass mapreduce procedure, after which, the input becomes to the following
structure:

<PageA, 0.5> -> Page1, Page2, Page3...
<PageB, 0.5> -> page1', Page2', Page3'...

This data structure indicates the outlinks for PageA, PageB... But in Pagerank, we concern more
about the inlinks. So the first thing we need to do in Mapper function is to invert this data structure.
Second, the factor Rn/Nn can be also easily calculated at this stage for each webpage.
>>>>>Pseudo Code Starts<<<<<
function Mapper
input <PageN, RankN> -> PageA, PageB, PageC...
begin
     Nn := the number of outlinks for PageN;
     for each outlink PageK
           output PageK-> <PageN, RankN/Nn>
     // output the outlinks for PageN.
     output PageN-> PageA, PageB, PageC…
end
>>>>>Pseudo Code Ends<<<<<

So after Mapper function, we have the following data structure:
PageK-> <PageN1, RankN1/Nn1>
           <PageN2, RankN2/Nn2>
           ...
           <PageAk, PageBk, PageCk…>

where PageN1, PageN2 is the inlinks of PageK, and PageAk, PageBk… are the outlinks. These
outlinks are saved because we need them to rebuild the input format for the next iteration.

The job for Reducer function is to update the pagerank for PageK, using the ranks of inlinks.

>>>>>Pseudo Code Starts<<<<<
function Reducer
input PageK-> <PageN1, RankN1/Nn1>
                <PageN2, RankN2/Nn2>
                ...
                <PageAk, PageBk, PageCk…> //outlinks of PageK
begin
     RankK := 0;
     for each inlink PageNi
           RankK += RankNi/Nni * beta
     // output the PageK and its new Rank for the next iteration
     output <PageK, RankK> -> <PageAk, PageBk, PageCk…>
end
>>>>>Pseudo Code Ends<<<<<

It's almost done! All you need is to run this Map-Reduce procedure iteratively till convergence. To
my knowledge, 15 iterations are enough in small cases, but larger case may take hundreds of
iterations.
Let’s Taste It

Here is some top pageranks of today (23/Feb)'s New York Times website (I only crawled world
news). There are around 14000 pages in total.
The statistics is interesting, and surely, you will find the same distribution pattern if you try to
rank more website and plot more.

http://www.nytimes.com/    1307.420286
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html 834.3881531
http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl 372.152869
http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp      119.3507215
http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html      81.86431592
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=all 70.64134337
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2 60.43903422
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw 59.92536076
http://www.nytimes.com/2010/02/23/world/asia/23afghan.html?hpw 59.92536076
http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html?hp        59.92536076
http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw         59.92536076
http://www.nytimes.com/2010/02/22/world/americas/22gays.html       47.20198605
http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl=&pagewanted=all          41.90279295
http://www.nytimes.com/2010/02/24/world/asia/24afghan.html 10.56957042
http://www.nytimes.com/2010/02/23/world/asia/23afghan.html 10.1774012
http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html       10.02345548
http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp=&pagewanted=all 9.718268616
http://www.nytimes.com/2010/02/17/world/asia/17afghan.html?fta=y 7.693364285
http://www.nytimes.com/2010/02/20/world/asia/20drones.html?fta=y 7.615689791
http://www.nytimes.com/2002/07/07/world/afghan-official-is-assassinated-blow-to-karzai.html   6.967736047
http://www.nytimes.com/2010/01/08/world/asia/08afghan.html?fta=y 6.29221417
http://www.nytimes.com/2010/01/07/world/asia/07afghan.html?fta=y 5.909407591
http://www.nytimes.com/2010/02/22/world/americas/22gays.html?pagewanted=all         5.799193096
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw=&pagewanted=all           5.657414027
http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw=&pagewanted=all         5.534888339
http://www.nytimes.com/2010/02/23/world/asia/23taliban.html?ref=asia    5.3387006
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2&hpw 4.840375361
http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=1 4.816480315
http://www.nytimes.com/2010/02/16/world/asia/16intel.html    4.671784654
http://www.nytimes.com/2010/02/15/world/asia/15afghan.html?fta=y 4.292737953
Here is the rank distribution (only top 30):




Tools

The ranks I plotted above were calculated by a small search engine which contains the Pagerank
calculation on Mapreduce. Fortunately, it seems working fine with the Bluecrystal cluster.
Further detailed usage can be found in its website: http://joycrawler.googlecode.com
Or feel free to email me: sl9885@bristol.ac.uk

More Related Content

Similar to Pagerank is a good thing

Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithmalexandrelevada
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)James Arnold
 
Done reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteDone reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteJames Arnold
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)James Arnold
 
Local Approximation of PageRank
Local Approximation of PageRankLocal Approximation of PageRank
Local Approximation of PageRanksjuyal
 
Eric Fan Insight Project Demo
Eric Fan Insight Project DemoEric Fan Insight Project Demo
Eric Fan Insight Project DemoEric Fan
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibEl Habib NFAOUI
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerankChengeng Ma
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESSubhajit Sahu
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLDataWorks Summit
 

Similar to Pagerank is a good thing (20)

Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithm
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
 
Done reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteDone reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcomplete
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
 
Local Approximation of PageRank
Local Approximation of PageRankLocal Approximation of PageRank
Local Approximation of PageRank
 
Eric Fan Insight Project Demo
Eric Fan Insight Project DemoEric Fan Insight Project Demo
Eric Fan Insight Project Demo
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
J046045558
J046045558J046045558
J046045558
 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerank
 
T180304125129
T180304125129T180304125129
T180304125129
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTES
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
I04015559
I04015559I04015559
I04015559
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
Page Rank Link Farm Detection
 
~Ns2~
~Ns2~~Ns2~
~Ns2~
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQL
 

Recently uploaded

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 

Recently uploaded (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Pagerank is a good thing

  • 1. Pagerank on Mapreduce (With Example) Song Liu, sl9885@bristol.ac.uk Pagerank is a Good Thing. It can indicate the internal authority of each node in a network by investigating the relationships between them. The original formula is can be seen here http://en.wikipedia.org/wiki/PageRank For simplicity, it can be written in this way: Rv=beta * Sigma(Rn/Nn)+ (1-beta)/N Where Rv is the Pagerank of PageV and Rn represents the Pagerank of PageN which links to PageV, and Nn is the number of total links of PageN. By summing up (Rn/Nn) for all inlinks of PageV, we can get the initial Rank of PageV. To normalize this rank, we use two parameters. Beta is a parameter which is called "damping coefficient", and N is the total number of webpages. Again, for simplicity, we don’t consider the normalization: the "damping coefficient" beta and total number N. Let's Parallelize It We assume the input follows the logical pattern below: PageA -> Page1, Page2, Page3... PageB -> page1', Page2', Page3'... Before the first iteration, we need to assign the initial rank for each webpage. To my experience, 0.5 is a good number for small website or network. In real calculation, this assignment can be done by a single-pass mapreduce procedure, after which, the input becomes to the following structure: <PageA, 0.5> -> Page1, Page2, Page3... <PageB, 0.5> -> page1', Page2', Page3'... This data structure indicates the outlinks for PageA, PageB... But in Pagerank, we concern more about the inlinks. So the first thing we need to do in Mapper function is to invert this data structure. Second, the factor Rn/Nn can be also easily calculated at this stage for each webpage.
  • 2. >>>>>Pseudo Code Starts<<<<< function Mapper input <PageN, RankN> -> PageA, PageB, PageC... begin Nn := the number of outlinks for PageN; for each outlink PageK output PageK-> <PageN, RankN/Nn> // output the outlinks for PageN. output PageN-> PageA, PageB, PageC… end >>>>>Pseudo Code Ends<<<<< So after Mapper function, we have the following data structure: PageK-> <PageN1, RankN1/Nn1> <PageN2, RankN2/Nn2> ... <PageAk, PageBk, PageCk…> where PageN1, PageN2 is the inlinks of PageK, and PageAk, PageBk… are the outlinks. These outlinks are saved because we need them to rebuild the input format for the next iteration. The job for Reducer function is to update the pagerank for PageK, using the ranks of inlinks. >>>>>Pseudo Code Starts<<<<< function Reducer input PageK-> <PageN1, RankN1/Nn1> <PageN2, RankN2/Nn2> ... <PageAk, PageBk, PageCk…> //outlinks of PageK begin RankK := 0; for each inlink PageNi RankK += RankNi/Nni * beta // output the PageK and its new Rank for the next iteration output <PageK, RankK> -> <PageAk, PageBk, PageCk…> end >>>>>Pseudo Code Ends<<<<< It's almost done! All you need is to run this Map-Reduce procedure iteratively till convergence. To my knowledge, 15 iterations are enough in small cases, but larger case may take hundreds of iterations.
  • 3. Let’s Taste It Here is some top pageranks of today (23/Feb)'s New York Times website (I only crawled world news). There are around 14000 pages in total. The statistics is interesting, and surely, you will find the same distribution pattern if you try to rank more website and plot more. http://www.nytimes.com/ 1307.420286 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html 834.3881531 http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl 372.152869 http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp 119.3507215 http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html 81.86431592 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=all 70.64134337 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2 60.43903422 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw 59.92536076 http://www.nytimes.com/2010/02/23/world/asia/23afghan.html?hpw 59.92536076 http://www.nytimes.com/2010/02/23/world/asia/23islamabad.html?hp 59.92536076 http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw 59.92536076 http://www.nytimes.com/2010/02/22/world/americas/22gays.html 47.20198605 http://www.nytimes.com/2010/02/22/world/americas/22gays.html?bl=&pagewanted=all 41.90279295 http://www.nytimes.com/2010/02/24/world/asia/24afghan.html 10.56957042 http://www.nytimes.com/2010/02/23/world/asia/23afghan.html 10.1774012 http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html 10.02345548 http://www.nytimes.com/2010/02/24/world/asia/24afghan.html?hp=&pagewanted=all 9.718268616 http://www.nytimes.com/2010/02/17/world/asia/17afghan.html?fta=y 7.693364285 http://www.nytimes.com/2010/02/20/world/asia/20drones.html?fta=y 7.615689791 http://www.nytimes.com/2002/07/07/world/afghan-official-is-assassinated-blow-to-karzai.html 6.967736047 http://www.nytimes.com/2010/01/08/world/asia/08afghan.html?fta=y 6.29221417 http://www.nytimes.com/2010/01/07/world/asia/07afghan.html?fta=y 5.909407591 http://www.nytimes.com/2010/02/22/world/americas/22gays.html?pagewanted=all 5.799193096 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?hpw=&pagewanted=all 5.657414027 http://www.nytimes.com/2010/02/23/world/middleeast/23mideast.html?hpw=&pagewanted=all 5.534888339 http://www.nytimes.com/2010/02/23/world/asia/23taliban.html?ref=asia 5.3387006 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=2&hpw 4.840375361 http://www.nytimes.com/2010/02/23/world/americas/23amputee.html?pagewanted=1 4.816480315 http://www.nytimes.com/2010/02/16/world/asia/16intel.html 4.671784654 http://www.nytimes.com/2010/02/15/world/asia/15afghan.html?fta=y 4.292737953
  • 4. Here is the rank distribution (only top 30): Tools The ranks I plotted above were calculated by a small search engine which contains the Pagerank calculation on Mapreduce. Fortunately, it seems working fine with the Bluecrystal cluster. Further detailed usage can be found in its website: http://joycrawler.googlecode.com Or feel free to email me: sl9885@bristol.ac.uk