How does Google Google: A journey into the wondrous mathematics behind your favorite websites

How Does Google? !
!

David F. Gleich!
Computer Science!
Purdue University!

A journey into the wondrous mathematics
behind your favorite websites
1

Mathematics underlies an
enormous number of the
websites we use everyday!
2

1.  ‘s PageRank

2.  Multi-armed bandits and
internet experiments
3

Larry Page !
Sergey Brin!

•  Created a web-search algorithm
called “backrub”
•  Spun-off a company “Googol”
based on the paper

•  The importance of a page is
determined by the importance of
pages that link to it.
Lawrence Page, Sergey Brin, Rajeev Motwani,Terry
Winograd “The PageRank Citation Ranking: Bringing
Order to the Web” TR, Stanford InfoLab, 1999

5

A websearch primer
1.  Crawl webpages
2.  Analyze webpage text (information retrieval)
3.  Analyze webpage links
4.  Fit over 200 measures to human evaluations
5.  Produce rankings
6.  Continuously update
6

Pages, nodes, incoming links,
outgoing links, and “importance”
7
“Important” pages
that link to me!
c
b
a
“Important”
pages that
link to
Purdue!

Tim Davis andYifan Hu

Sparse Matrix Gallery

http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
1000 vertices on
8.5-by-11 paper
1,000,000,000,000
vertices (one trillion)

Paper the size of
Manhattan island !
(23 sq miles)?
The web
10

A wee web-graph: link
counting is too easy to game!
1

2

3

4

5

6

1/3

1/3

1/3

1/2

1/2

12

A wee web-graph: link
counting is too easy to game!
1

2

3

4

5

6

1/3

1/3

1/3

1/2

1/2

The importance of a
page is determined
by the importance of
pages that link to it.
x1 = 0
x2 =
1
3
x1
x3 =
1
3
x1 +
1
2
x2
x4 =
1
3
x1 + x3 + x5
x5 = x4
x6 =
1
2
x2
13

The importance of a page is determined
by the importance of pages that link to it
xi =
X
j2Bi
1
dj
xj
“Back-links from page i”
Why it was called Backrub!

“Importance” of page i
“Importance” of page j
Number of links page j uses!
out-degree in graph theory

x3 =
1
3
x1 +
1
2
x2
1

2

3

1/3

1/2

14

We can rewrite this equation in a more
mathematically convenient way
1 1 2 3 4 5 6
2 1 2 3 4 5 6
3 1 2 3 4 5 6
4 1 2 3 4 5 6
5 1 2 3 4 5 6
6 1 2 3 4 5 6
x 0 x 0 x 0 x 0 x 0 x 0 x
1
x x 0 x 0 x 0 x 0 x 0 x
3
1 1
x x x 0 x 0 x 0 x 0 x
3 2
1
x x 0 x 1x 0 x 1x 0 x
3
x 0 x 0 x 0 x 1x 0 x 0 x
1
x 0 x x 0 x 0 x 0 x 0 x
2
= + + + + +
= + + + + +
= + + + + +
= + + + + +
= + + + + +
= + + + + +
15

1 1
2 2
3 3
4 4
5 5
6 6
x x0 0 0 0 0 0
x x1/ 3 0 0 0 0 0
x x1/ 3 1/ 2 0 0 0 0
or
x x1/ 3 0 1 0 1 0
x x0 0 0 1 0 0
x x0 1/ 2 0 0 0 0
⎡ ⎤ ⎡ ⎤⎡ ⎤
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
=⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
x = Px
And even more conveniently!
Element k in column m = "probability" of
going from node m to node k
16

The matrix P for websites
shows a lot of structure
Every dot is a non-zero element indicating a link
Matrices are sparse, and generally with block structure
block structure can be explored to speed up ranking algorithm
17

But this idea doesn’t work for
the wee web-graph
1

2

3

4

5

6

1/3

1/3

1/3

1/2

1/2

Nodes 1, 4 and 5
determine everything!
x1 = 0
x2 =
1
3
x1
x3 =
1
3
x1 +
1
2
x2
x4 =
1
3
x1 + x3 + x5
x5 = x4
x6 =
1
2
x2
x1 = 0
x2 =
1
3
x1 = 0
x3 =
1
3
x1 +
1
2
x2 = 0
x4 =
1
3
x1 + x3 + x5 = x5
x5 = x4
x6 =
1
2
x2 = 0
18

But this idea doesn’t work for
the wee web-graph
1

2

3

4

5

6

1/3

1/3

1/3

1/2

1/2

Node 1 !
“lonely”

Nodes 4 and 5 !
“mutual admiration
societies”

Node 6
“anti-social”
These nodes need to be “ﬁxed” to get a
reliable and useful ranking!
19

The gang of four to the rescue
Andrei
Markov
Oscar
Perron
Georg
Frogenius
Richard !
von Mises
20

Let’s ﬁx it up and force node 6 to
choose, or link to everyone
1
2
3
4
5
6
P =
2
6
6
6
6
6
6
4
0 0 0 0 0 0
1/3 0 0 0 0 0
1/3 1/2 0 0 0 0
1/3 0 1 0 1 0
0 0 0 1 0 0
0 1/2 0 0 0 0
3
7
7
7
7
7
7
5
P =
2
6
6
6
6
6
6
4
0 0 0 0 0 1/6
1/3 0 0 0 0 1/6
1/3 1/2 0 0 0 1/6
1/3 0 1 0 1 1/6
0 0 0 1 0 1/6
0 1/2 0 0 0 1/6
3
7
7
7
7
7
7
5
21

Taxation is the way to
representation!
c
b
a
If is a good page, then
it’ll still be a good page if
we “tax” the importance
from a, b, and c

We can redistribute the
taxed amounts to all
including lonely nodes!
22

The importance of a page is determined
by the importance of pages that link to it*
* After tax and any beneﬁts
The total importance that page j !
contributes to page i
Beneﬁts to page i
The taxation rate of all
xi =
X
j2Bi
↵
xj
dj
+ (1 ↵)bi
23

x1
x2
x3
x4
x5
x6
!
"
#
#
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
&
&
= α
0 0 0 0 0 1/ 6
1/ 3 0 0 0 0 1/ 6
1/ 3 1/ 2 0 0 0 1/ 6
1/ 3 0 1 0 1 1/ 6
0 0 0 1 0 1/ 6
0 1/ 2 0 0 0 1/ 6
!
"
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
x1
x2
x3
x4
x5
x6
!
"
#
#
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
&
&
+(1− α)
b1
b2
b3
b4
b5
b6
!
"
#
#
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
&
&
Perron and Frobenius showed the new
equation always has a unique solution
x = ↵Px + (1 ↵)b
24

1

2

3

4

5

6

1/3

1/3

1/3

1/2

1/2

What von Mises and Richardson showed
is that guess, check, and correct works!
x(new)
= ↵Px(old)
+ (1 ↵)b
x(start)
=
2
6
6
6
6
6
6
4
0.17
0.17
0.17
0.17
0.17
0.17
3
7
7
7
7
7
7
5
x(1)
=
2
6
6
6
6
6
6
4
0.05
0.10
0.17
0.38
0.19
0.12
3
7
7
7
7
7
7
5
x(2)
=
2
6
6
6
6
6
6
4
0.04
0.06
0.10
0.36
0.36
0.08
3
7
7
7
7
7
7
5
x(1)
=
2
6
6
6
6
6
6
4
0.03
0.04
0.06
0.43
0.39
0.05
3
7
7
7
7
7
7
5
25

There’s still a lot of work left to
do to make a search engine
Make it fast!
Watch out for spam
Watch out for manipulation
Personalize

Experiment!
27

1.  ‘s PageRank

2.  Multi-armed bandits and
internet experiments
28

http://adamlofting.com/736/drawn-multi-armed-bandit-experiments/multi-armed-bandit/
Not this!
29

http://upload.wikimedia.org/wikipedia/en/8/82/Las_Vegas_slot_machines.jpg
This!
Pays out !
$0.92/
dollar
Pays out !
$0.98/
dollar
Pays out !
$0.95/
dollar
Pays out !
$0.99/
dollar
30

What in the heck does a multi-armed
bandit have to do with Google?
31

What in the heck does a multi-armed
bandit have to do with Google?
Pays out !
$0.92/
view
Pays out !
$0.66/
view
Pays out !
$0.91/
view to
show ads
Pays out !
-$0.02/view
hide ads
32

How to optimize your website
without exploiting the bandits
Try condition A 100 times, find 45 “wins”
Try condition B 100 times, find 85 “wins”
Try condition C 100 times, find 10 “wins”
…
Choose the best!
33

This ﬁeld has some of the
best terminology

Explore !

Exploit !

Regret
34

best terminology

Explore – Visiting Las Vegas!

Exploit – Your new winning strategy!

Regret – That you didn’t quit after
winning the ﬁrst round
35

best terminology

Explore – Testing slot machines/
experiments for their reward
Exploit – Playing the best reward
you’ve found so far
Regret – How much you lost due !
to exploration
36

without exploiting the bandits
Try condition A 100 times, find 45 “wins”
Try condition B 100 times, find 85 “wins”
Try condition C 100 times, find 10 “wins”
…
Choose the best!
Pure
exploration!
We only exploit our findings at the end!
37

exploiting the bandits
Try condition A 5 times, find 4 wins!
Try condition B 5 times, find 4 wins!
Try condition C 5 times, find 2 wins

Try condition A 7 times, find 3 wins!
Try condition B 7 times, find 5 wins!
Try condition C 1 time, find 0 wins

Pure
exploration!
Exploit our
knowledge
Condition
A
B
C
Est. Return
0.58
0.75
0.33
38

The goal of these problems is to construct
optimal strategies to minimize regret
Regret how much you left “on the table” by exploring

zero-regret strategy is one where

regret(T trials) is sublinear in T!

as the number of plays T → ∞

E[play best always plays made based on data]
regret 100-each 255/300 140/300 = 0.38
regret 30-mixed 25.5/30 0.45 ⇥ 12 + 0.85 ⇥ 12 + 0.1 ⇥ 6 = 0.31
39

[The bandit problem] was formulated during the [second
world] war, and efforts to solve it so sapped the energies
and minds of Allied analysts that the suggestion was
made that the problem be dropped over Germany, as the
ultimate instrument of intellectual sabotage.

Peter Whittle (Whittle, 1979)
Discussion of “Bandit processes and dynamical allocation indices”
Their importance to website optimization,
advertising, and recommendation has
rejuvenated research on these problems
with fascinating new questions.
40

Math is everywhere and
especially your favorite
websites!
Matrices and probability are
key ingredients.
41

PageRank on Wikipedia
= 0.50
United States
C:Living people
France
Germany
England
United Kingdom
Canada
Japan
Poland
Australia
= 0.85
United States
C:Main topic classif.
C:Contents
C:Living people
C:Ctgs. by country
United Kingdom
C:Fundamental
C:Ctgs. by topic
C:Wikipedia admin.
France
= 0.99
C:Contents
C:Main topic classif.
C:Fundamental
United States
C:Wikipedia admin.
P:List of portals
P:Contents/Portals
C:Portals
C:Society
C:Ctgs. by topic
Note Top 10 articles on Wikipedia with highest PageRank
David F. Gleich (Sandia) Sensitivity Purdue 11 / 36
42

How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Similar to How does Google Google: A journey into the wondrous mathematics behind your favorite websites (20)

More from David Gleich

More from David Gleich (8)

Recently uploaded

Recently uploaded (20)

How does Google Google: A journey into the wondrous mathematics behind your favorite websites