A talk I gave at the annual meeting for the MetroNY section of the MAA about how Google works from a link-ranking perspective. (http://sections.maa.org/metrony/)
Based on a talk by Margot Gerritsen (which used elements from another talk I gave years ago, yay co-author improvements!)
5. Larry Page !
Sergey Brin!
• Created a web-search algorithm
called “backrub”
• Spun-off a company “Googol”
based on the paper
• The importance of a page is
determined by the importance of
pages that link to it.
Lawrence Page, Sergey Brin, Rajeev Motwani,Terry
Winograd “The PageRank Citation Ranking: Bringing
Order to the Web” TR, Stanford InfoLab, 1999
5
6. A websearch primer
1. Crawl webpages
2. Analyze webpage text (information retrieval)
3. Analyze webpage links
4. Fit over 200 measures to human evaluations
5. Produce rankings
6. Continuously update
6
7. Pages, nodes, incoming links,
outgoing links, and “importance”
7
“Important” pages
that link to me!
c
b
a
“Important”
pages that
link to
Purdue!
12. A wee web-graph: link
counting is too easy to game!
1
2
3
4
5
6
1/3
1/3
1/3
1/2
1/2
12
13. A wee web-graph: link
counting is too easy to game!
1
2
3
4
5
6
1/3
1/3
1/3
1/2
1/2
The importance of a
page is determined
by the importance of
pages that link to it.
x1 = 0
x2 =
1
3
x1
x3 =
1
3
x1 +
1
2
x2
x4 =
1
3
x1 + x3 + x5
x5 = x4
x6 =
1
2
x2
13
14. The importance of a page is determined
by the importance of pages that link to it
xi =
X
j2Bi
1
dj
xj
“Back-links from page i”
Why it was called Backrub!
“Importance” of page i
“Importance” of page j
Number of links page j uses!
out-degree in graph theory
x3 =
1
3
x1 +
1
2
x2
1
2
3
1/3
1/2
14
15. We can rewrite this equation in a more
mathematically convenient way
1 1 2 3 4 5 6
2 1 2 3 4 5 6
3 1 2 3 4 5 6
4 1 2 3 4 5 6
5 1 2 3 4 5 6
6 1 2 3 4 5 6
x 0 x 0 x 0 x 0 x 0 x 0 x
1
x x 0 x 0 x 0 x 0 x 0 x
3
1 1
x x x 0 x 0 x 0 x 0 x
3 2
1
x x 0 x 1x 0 x 1x 0 x
3
x 0 x 0 x 0 x 1x 0 x 0 x
1
x 0 x x 0 x 0 x 0 x 0 x
2
= + + + + +
= + + + + +
= + + + + +
= + + + + +
= + + + + +
= + + + + +
15
16. 1 1
2 2
3 3
4 4
5 5
6 6
x x0 0 0 0 0 0
x x1/ 3 0 0 0 0 0
x x1/ 3 1/ 2 0 0 0 0
or
x x1/ 3 0 1 0 1 0
x x0 0 0 1 0 0
x x0 1/ 2 0 0 0 0
⎡ ⎤ ⎡ ⎤⎡ ⎤
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
=⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
x = Px
And even more conveniently!
Element k in column m = "probability" of
going from node m to node k
16
17. The matrix P for websites
shows a lot of structure
Every dot is a non-zero element indicating a link
Matrices are sparse, and generally with block structure
block structure can be explored to speed up ranking algorithm
17
19. But this idea doesn’t work for
the wee web-graph
1
2
3
4
5
6
1/3
1/3
1/3
1/2
1/2
Node 1 !
“lonely”
Nodes 4 and 5 !
“mutual admiration
societies”
Node 6
“anti-social”
These nodes need to be “fixed” to get a
reliable and useful ranking!
19
20. The gang of four to the rescue
Andrei
Markov
Oscar
Perron
Georg
Frogenius
Richard !
von Mises
20
22. Taxation is the way to
representation!
c
b
a
If is a good page, then
it’ll still be a good page if
we “tax” the importance
from a, b, and c
We can redistribute the
taxed amounts to all
including lonely nodes!
22
23. The importance of a page is determined
by the importance of pages that link to it*
* After tax and any benefits
The total importance that page j !
contributes to page i
Benefits to page i
The taxation rate of all
xi =
X
j2Bi
↵
xj
dj
+ (1 ↵)bi
23
27. There’s still a lot of work left to
do to make a search engine
Make it fast!
Watch out for spam
Watch out for manipulation
Personalize
Experiment!
27
31. What in the heck does a multi-armed
bandit have to do with Google?
31
32. What in the heck does a multi-armed
bandit have to do with Google?
Pays out !
$0.92/
view
Pays out !
$0.66/
view
Pays out !
$0.91/
view to
show ads
Pays out !
-$0.02/view
hide ads
32
33. How to optimize your website
without exploiting the bandits
Try condition A 100 times, find 45 “wins”
Try condition B 100 times, find 85 “wins”
Try condition C 100 times, find 10 “wins”
…
Choose the best!
33
34. This field has some of the
best terminology
Explore !
Exploit !
Regret
34
35. This field has some of the
best terminology
Explore – Visiting Las Vegas!
Exploit – Your new winning strategy!
Regret – That you didn’t quit after
winning the first round
35
36. This field has some of the
best terminology
Explore – Testing slot machines/
experiments for their reward
Exploit – Playing the best reward
you’ve found so far
Regret – How much you lost due !
to exploration
36
37. How to optimize your website
without exploiting the bandits
Try condition A 100 times, find 45 “wins”
Try condition B 100 times, find 85 “wins”
Try condition C 100 times, find 10 “wins”
…
Choose the best!
Pure
exploration!
We only exploit our findings at the end!
37
38. How to optimize your website
exploiting the bandits
Try condition A 5 times, find 4 wins!
Try condition B 5 times, find 4 wins!
Try condition C 5 times, find 2 wins
Try condition A 7 times, find 3 wins!
Try condition B 7 times, find 5 wins!
Try condition C 1 time, find 0 wins
Pure
exploration!
Exploit our
knowledge
Condition
A
B
C
Est. Return
0.58
0.75
0.33
38
39. The goal of these problems is to construct
optimal strategies to minimize regret
Regret how much you left “on the table” by exploring
zero-regret strategy is one where
regret(T trials) is sublinear in T!
as the number of plays T → ∞
E[play best always plays made based on data]
regret 100-each 255/300 140/300 = 0.38
regret 30-mixed 25.5/30 0.45 ⇥ 12 + 0.85 ⇥ 12 + 0.1 ⇥ 6 = 0.31
39
40. [The bandit problem] was formulated during the [second
world] war, and efforts to solve it so sapped the energies
and minds of Allied analysts that the suggestion was
made that the problem be dropped over Germany, as the
ultimate instrument of intellectual sabotage.
Peter Whittle (Whittle, 1979)
Discussion of “Bandit processes and dynamical allocation indices”
Their importance to website optimization,
advertising, and recommendation has
rejuvenated research on these problems
with fascinating new questions.
40
41. Math is everywhere and
especially your favorite
websites!
Matrices and probability are
key ingredients.
41
42. PageRank on Wikipedia
= 0.50
United States
C:Living people
France
Germany
England
United Kingdom
Canada
Japan
Poland
Australia
= 0.85
United States
C:Main topic classif.
C:Contents
C:Living people
C:Ctgs. by country
United Kingdom
C:Fundamental
C:Ctgs. by topic
C:Wikipedia admin.
France
= 0.99
C:Contents
C:Main topic classif.
C:Fundamental
United States
C:Wikipedia admin.
P:List of portals
P:Contents/Portals
C:Portals
C:Society
C:Ctgs. by topic
Note Top 10 articles on Wikipedia with highest PageRank
David F. Gleich (Sandia) Sensitivity Purdue 11 / 36
42