Linear algebra behind Google search

Web Scores Approach 1 Approach 2 Approach 3 Dangling... Disconnected... Google’s approach Computational scheme
Linear Algebra behind Google Search
Dr. V.N. Krishnachandran
Department of Computer Applications
Vidya Academy of Science and Technology
Thrissur - 680501, Kerala.
August 2011

Outline
1 Web: An example
2 Importance score
3 First unsuccessful approach
4 Second unsuccessful approach
5 Third unsuccessful approach
6 Dangling nodes
7 Disconnected webs
8 Google approach
9 Computational scheme

Web world
The web world consists of a number of pages and links from some
of the pages to some other pages.
In a diagrammatic representation of a web world, pages are denoted
by small squares or circles and links are indicated by arrows.
See a simpliﬁed web world in next slide.

Web world
Example 1: A web with four pages numbered 1,2,3,4.

Links
In the ﬁgure above, arrow denotes:
an incoming link (also called a backlink) to Page q.
an outgoing link from Page p.

Links
Outgoing links in Example 1

Links
Incoming links in Example 1

Importance score
In Google’s search algorithm, the most important concept is that
of the importance score of a page.
This we explain in the next few slides...

Importance score
The importance score, or simply the score, of a page is a
number which is a measure of the relative importance of a
page.
The importance score is a nonnegative real number.
The importance score of a page is derived from the backlinks
for that page.

Importance score vector
We denote the importance score of Page k by xk.
Let there be n pages in the web. The column vector
x = [x1 x2 · · · xn]T
is called the importance score vector.
The importance score vector x is said to be normalised if
x1 + x2 + · · · xn = 1.

Unsuccessful attempts to define importance score
Before considering Google’s approach, we consider
three unsuccessful attempts to define the concept of the
importance score of a page.
A study of these unsuccessful attempts helps one appreciate the
significance of Google’s approach.

Importance score:
First unsuccessful approach

Importance score: First unsuccessful approach
Deﬁnition (First unsuccessful approach)
Importance score of Page k is the number of backlinks for Page k.

Importance scores in Example 1

Importance score
Importance score: A desirable property
“A link to Page k from an important page must increase Page k’s
score more than a link from an unimportant page.”
First unsuccessful approach does not have this property.
(see next slide)

Importance score of Page 1 must be higher than that of Page 4.

Importance score:
Second unsuccessful approach

Importance score: Second unsuccessful approach
Deﬁnition (Second unsuccessful approach)
The importance score of a page is the sum of the scores of all
pages linking to the page.

Importance scores in Example 1
The importance scores in Example 1 (second approach) are
solutions of the following system of equations:
x1 = x3 + x4
x2 = x1
x3 = x1 + x2 + x4
x4 = x1 + x2

Importance scores in Example 1 : Matrix formulation
H =




0 0 1 1
1 0 0 0
1 1 0 1
1 1 0 0




x = [x1 x2 x3 x4]T
Hx = x

x is an eigenvector with eigenvalue 1 for the matrix H.
1 is not an eigenvalue of H.
There is no eigenvector with eigenvalue 1 for the matrix H.
The second approach does not produce importance scores to pages
in Example 1 .

Importance score: An undesirable property
“A page with many outgoing links has a bigger inﬂuence on the
scores of other pages than a page with less number of outgoing
links.”
This is undesirable.
The recommendation letter of a Professor who is choosy in giving
such letters carries higher value than that of a Professor who is
very liberal in issuing such letters.

Importance score:
Third unsuccessful approach

Importance score: Third unsuccessful approach
Notations
n = Number of pages in the web
Pages indexed by k = 1, 2, . . . , n.
nj = Number of outgoing links from page j
Lk = Set of indices of backlinks for page k

Deﬁnition (Third unsuccessful approach)
Let the web contain n pages and let it be indexed by an integer k,
1 ≤ k ≤ n. Let Lk ⊆ {1, 2, . . . , n} be the set of backlinks for Page
k, and nj the number of outgoing links from Page j. Then
xk =
j∈Lk
xj
nj
, k = 1, 2, . . . , n.

Importance scores in Example 1 : Notations
n = 4, k = 1, 2, 3, 4.

n1 = 3, n2 = 2, n3 = 1, n4 = 2

L1 = {3, 4}, L2 = {1}, L3 = {1, 2, 4}, L4 = {1, 2}

Importance scores in Example 1 : Equations
Expression to compute x1:
x1 =
j∈L1
xj
nj
=
j∈{3,4}
xj
nj
=
x3
n3
+
x4
n4
=
x3
1
+
x4
2
Similar expressions for x2, x3 and x4. (See next slide ...)

Linear system of equations to compute importance score:
x1 =
x3
1
+
x4
2
x2 =
x1
3
x3 =
x1
3
+
x2
2
+
x4
2
x4 =
x1
3
+
x2
2

The link matrix of web world in Example 1:
A =




0 0 1 1
2
1
3 0 0 0
1
3
1
2 0 1
2
1
3
1
2 0 0




x = [x1 x2 x3 x4]T
Ax = x

x is an eigenvector with eigenvalue 1 for the link matrix A.
1 is indeed an eigenvalue of A.
All multiples of the vector [12 4 9 6] are eigenvectors of
A corresponding to the eigenvalue 1.
The normalised importance score vector for the web in
Example 1 is
x =
12
31
4
31
9
31
6
31
= [0.387 0.129 0.290 0.194] (approx.)

Limitations of
third unsuccessful approach
Third unsuccessful approach has two severe limitations:
Problem of dangling nodes: If there are dangling nodes in the
web, one cannot assign importance scores to any page.
Problem of disconnected web: If the web is disconnected, one
cannot assign unique importance scores to all the pages in the
web.

Dangling nodes
Deﬁnition
A dangling node is a page with no outgoing links.

Dangling nodes
Example 2 : Web with dangling node
(Page 4 is a dangling node)

Dangling nodes
x1 = x3
x2 =
x1
3
x3 =
x1
3
+
x2
2
x4 =
x1
3
+
x2
2

Dangling nodes
Link matrix for the web in Example 2:
A =




0 0 1 0
1
3 0 0 0
1
3
1
2 0 0
1
3
1
2 0 0




x = [x1 x2 x3 x4]T
Ax = x

Dangling nodes
Importance scores in Example 2 : Values
x is an eigenvector with eigenvalue 1 for the matrix A.
1 is not an eigenvalue of A.
There is no eigenvector with eigenvalue 1 for the matrix A.
The deﬁnition (third approach) does not produce importance
scores to pages in Example 2 .

Dangling nodes
Mathematics
Deﬁnition
A square matrix is called a column-schochastic matrix if all its
entries are nonnegative and the entries in each column sum to 1.
Theorem
Every column-stochastic matrix has 1 as an eigenvalue.

Dangling nodes
Mathematics
Theorem
The link matrix for a web with no dangling nodes is
column-stochastic.
Theorem
The link matrix for a web with no dangling nodes has 1 as an
eigenvalue.

Disconnected webs
Deﬁnition
A web W is disconnected if W can be partitioned into two
nonempty subwebs W1 and W2 such that there is no outgoing link
from any page in W1 to any page in W2 and vice versa.

Disconnected webs
Example 3 : A web with two disconnected subwebs
W1 (Pages 1, 2) and W2 (Pages 3, 4, 5)

Disconnected webs
x1 = x2
x2 = x1
x3 = x4 +
x5
2
x4 = x3 +
x5
2
x5 = 0

Disconnected webs
A =






0 1 0 0 0
1 0 0 0 0
0 0 0 1 1
2
0 0 1 0 1
2
0 0 0 0 0






x = [x1 x2 x3 x4]T
Ax = x

Disconnected webs
Importance scores in Example 3 : Values
Two linearly independent eigenvectors with eigenvalue 1:
x =
1
2
1
2
0 0 0
x = 0 0
1
2
1
2
0
These are linearly independent, normalised, importance score
vectors in Example 3 .

Disconnected webs
The third approach does not produce a unique importance score
for every page in a disconnected web.
In third approach:
Web is disconnected =⇒ Importance scores are not unique

Google’s approach

Google matrix: Deﬁnition
Consider a web with n pages.
Let A be the link matrix of the web.
Let S be an n × n matrix with all entries equal to 1
n .
Let m be such that 0 ≤ m ≤ 1.
Deﬁnition
The Google matrix of the web is
M = (1 − m)A + mS.

Google matrix: Damping factor
Deﬁnition
The constant 1 − m in the deﬁnition of the Google matrix is called
the damping factor of the Google matrix. (The creators of
Google’s search algorithm chose 0.85 as the damping factor.)

Google’s approach: Importance score
Deﬁnition
Let M be the Google matrix of a web having n pages. Let xk be
the importance score of Page k in the web and let
x = [x1 x2 · · · xn]T . Then a solution of the matrix equation
Mx = x
is called the importance score vector of the web.

Google’s approach: Importance score
Deﬁnition (alternate)
Let M be the Google matrix of a web having n pages. Let xk be
the importance score of Page k in the web and let
x = [x1 x2 · · · xn]T . Then an eigenvector of the matrix M
having eigenvalue 1 is called the importance score vector of the
web.

Google’s approach: Example 1
Google matrix: Example 1 .
m = 0.15
M = (1 − m)A + mS
= (1 − 0.15)




0 0 1 1
2
1
3 0 0 0
1
3
1
2 0 1
2
1
3
1
2 0 0



 + 0.15




1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4




=




0.03750 0.03750 0.88750 0.46250
0.3208¯3 0.03750 0.03750 0.03750
0.3208¯3 0.46250 0.03750 0.46250
0.3208¯3 0.46250 0.03750 0.03750





The importance scores are solutions of the matrix equation
Mx = x,
which are the eigenvectors of M having the eigenvalue 1.
M is column stochastic.
M has 1 as an eigenvalue.
M has an eigenvector having eigenvalue 1.
The web in Example 1 has an importance score vector as per
Google’s approach.
Is the important score vector unique?

The eigenvector of M (in Example 1) having eigenvalue 1 is
x =
106613
58520
40
57
57
40
1 .
The normalised importance score vector is (approximately)
x = [0.368 0.142 0.288 0.202].
The importance scores of the web pages are
x1 = 0.368, x2 = 0.142, x3 = 0.288, x4 = 0.202.

Example 2

Google matrix of web in Example 3 .
M = (1 − 0.15)






0 1 0 0 0
1 0 0 0 0
0 0 0 1 1
2
0 0 1 0 1
2
0 0 0 0 0






+ 0.15






1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5






=






0.030 0.880 0.030 0.030 0.030
0.880 0.030 0.030 0.030 0.030
0.030 0.030 0.030 0.880 0.455
0.030 0.030 0.880 0.030 0.455
0.030 0.030 0.030 0.030 0.030







M (in Example 3) is column stochastic.
M (in Example 3) has 1 as an eigenvalue.
The eigenvector of M (in Example 3) having eigenvalue 1 is
x = [0.200 0.200 0.285 0.285 0.030].
The importance scores of the web pages (in Example 3) are
x1 = 0.200, x2 = 0.200, x3 = 0.285, x4 = 0.285 x5 = 0.030.
The scores are all positive.
The scores are unique even though the web has disconnected
subwebs.

Google’s approach: Mathematics
Deﬁnition
A matrix P is said to be positive if all elements of P are positive.

Theorem
If a square matrix P is positive and column-stochastic, then any
eigenvector of P with eigenvalue 1 has all positive or negative
components.
Theorem
If a square matrix P is positive and column-stochastic, then the
eigenspace of P corresponding to the eigenvalue 1 has dimension 1.

Properties of Google matrix
Let M be the Google matrix of a web without dangling nodes.
M is positive.
M is column stochastic.
1 is an eigenvalue of M.
The eigenspace of M corresponding to the eigenvalue 1 has
dimension 1.
Continued in next slide

Properties of Google matrix (continued)
M has an eigenvector corresponding to the eigenvalue 1 with
all positive components.
M has a unique eigenvector x = [x1 x2 . . . xn]
corresponding to the eigenvalue 1 such that
xi > 0 for i = 1, 2, . . . , n.
x1 + x2 + · · · + xn = 1.

Computational scheme in
Google’s approach

Computational scheme
Notations:
Let W be a web with n pages and no dangling nodes.
Let A be the link matrix of the web W .
Let 1 − m be the damping factor.
Let u be the n-component column vector with all entries
equal to 1
n .
Let x(0) be some n-component column vector with positive
components and ||x(0)|| = 1.
Let q be the normalised importance score vector of the web
W .

Computational scheme
The scheme:
Generate the sequence x(1), x(2), . . . of column vectors using the
following iteration scheme:
x(r+1)
= (1 − m)Ax(r)
+ mu.
Then
q = lim
r→∞
x(r)
.

Computational scheme: Example
Compute the importance score vector of web in Example 1 .
Notations:
n = 4
A =




0 0 1 1
2
1
3 0 0 0
1
3
1
2 0 1
2
1
3
1
2 0 0




m = 0.15
u = 1
4
1
4
1
4
1
4
T
.

We choose x(0) = 1
4
1
4
1
4
1
4
T
.
In the next two slides we show the computations of x(1) and
x(2).

x(1)
= (1 − m)Ax(0)
+ mu
= (1 − 0.15)




0 0 1 1
2
1
3 0 0 0
1
3
1
2 0 1
2
1
3
1
2 0 0








1
4
1
4
1
4
1
4



 + 0.15




1
4
1
4
1
4
1
4




=




0.3562
0.1083
0.3208
0.2146





x(2)
= (1 − m)Ax(1)
+ mu
= (1 − 0.15)




0 0 1 1
2
1
3 0 0 0
1
3
1
2 0 1
2
1
3
1
2 0 0








0.3562
0.1083
0.3208
0.2146



 + 0.15




1
4
1
4
1
4
1
4




=




0.4014
0.1384
0.2757
0.1845





The values of x(3), x(4), etc. are tabulated in the next slide. Note
that x(11) and x(12) are nearly identical. So further computations
won’t yield more accurate results.

k x
(r)
1 x
(r)
2 x
(r)
3 x
(k)
4
0 0.2500 0.2500 0.2500 0.2500
1 0.3562 0.1083 0.3208 0.2146
2 0.4014 0.1384 0.2757 0.1845
3 0.3502 0.1512 0.2884 0.2101
4 0.3720 0.1367 0.2903 0.2010
5 0.3698 0.1429 0.2864 0.2010
6 0.3664 0.1422 0.2884 0.2030
7 0.3689 0.1413 0.2880 0.2018
8 0.3681 0.1420 0.2878 0.2021
9 0.3680 0.1418 0.2880 0.2021
10 0.3682 0.1418 0.2879 0.2020
11 0.3681 0.1418 0.2880 0.2021
12 0.3681 0.1418 0.2880 0.2021

The importance scores of various pages in Example 1 are as given
below:
x1 = 0.3681, x2 = 0.1418, x3 = 0.2880, x4 = 0.2021.

Computational scheme: Mathematics
Power method to ﬁnd an eigenvector of a matrix G.
Start with an initial guess (initial approximation) x(0).
Generate successive approximations x(r) by the iteration
scheme
x(r)
= Gx(r−1)
,
or equivalently,
x(r)
= Gr
x(0)
.
For large r, the vector x(r) is a good approximation to an
eigenvector of G.
The power method produces successive approximations to the
eigenvector corresponding to the largest eigenvalue of G.

Modified power method to find an eigenvector of a
matrix G.
Let x(r) = Gr x(0), for r = 1, 2, . . . .
x(r) may diverge to infinity or may decay to the zero vector.
A better iteration scheme is
x(r)
=
Gx(r−1)
||Gx(r−1)||
,
where || || is some vector norm.

Power method applied to Google matrix
We apply the power method to compute the importance score
vector of a web.
Power method can be applied to compute the importance
score eigenvector only if 1 is the largest eigenvalue of the
Google matrix.
However, we can prove that the power method can be applied
to compute the importance score eigenvector without showing
that 1 is the greatest eigenvalue of the Google matrix.
See next few slides ...

Power method applied to Google matrix
Let M be the Google matrix of a web. We have
M = (1 − m)A + mS.
Let x be a normalised column vector with positive components.
x(r+1)
= Mx(r)
= ((1 − m)A + mS)x(r)
= (1 − m)Ax(r)
+ mSx(r)
= (1 − m)Ax(r)
+ mu.

Deﬁnition
The 1-norm of a vector v is
||v||1 = |v1| + |v2| + · · · + |vn|.

Theorem
Let P be a positive column-stochastic n × n real matrix and let V
be the subspace of Rn consisting of vectors v such that j vj = 0.
Then:
1 Pv ∈ V for any v ∈ V .
2 ||Pv||1 ≤ c||v||1 for any v ∈ V , where
c = max
1≤j≤n
|1 − 2 min
1≤i≤n
Pij | < 1.

Theorem
Every positive column-stochastic matrix P has a unique vector q
with positive components such that Pq = q with ||q||1 = 1. The
vector q can be computed as
q = lim
r→∞
Pr
x0
for any initial guess x0 with positive components such that
||x0||1 = 1.

References
Kurt Brian and Tanya Leise, “The $25, 000, 000, 000
eigenvector: The linear algebra behind Google”, SIAM
Review, Vol.48, No.3, pp.568-581 (2005).
Amy N. Langville and Carl D. Meyer, ”Deeper Inside
PageRank”, 2004.
Hwai-Hui Fu, Dennis K.J. Lin and Hsien-Tang Tsai,
”Damping factor in Google page ranking”, Appl. Stochastic
Models Bus. Ind., 2006; 22:431444.
Christiane Rousseau and Yvan Saint-Aubin, Mathematics and
Technology (Chapter 9), Springer Undergraduate Texts in
Mathematics and Technology, 2008.
continued ...

References (continued)
Monica Bianchini, Marco Gori, and Franco Scarselli, ”Inside
PageRank”, ACM Transactions on Internet Technology, Vol.
5, No. 1, February 2005, Pages 92128.
Sergey Brin and Lawrence Page, ”The Anatomy of a
Large-Scale Hypertextual Web Search Engine”, In Proceedings
of the 7th World Wide Web Conference (WWW7), 1998.

Linear algebra behind Google search

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Plus de PlusOrMinusZero

Plus de PlusOrMinusZero (20)

Dernier

Dernier (20)

Linear algebra behind Google search