Ph.D. Defense: Models and Algorithms for PageRank sensitivity

Models and Algorithms for
PageRank Sensitivity

David F. Gleich
Stanford University

Ph.D. Oral Defense
Institute for Computational
and Mathematical Engineering

May 26, 2009

Gleich (Stanford) Ph.D. Defense 1 / 41

Outline
PageRank intro

Sensitivity

Random sensitivity

Inner-Outer

Summary


Five years!
2004 2009
Firefox 1.0 Firefox 3.5
Wikipedia? Wikipedia! YouTube! Hulu!
Facebook? Facebook! ﬂickr! Twitter!
Gmail? Gmail! Google Maps!
Yahoo! Yahoo?
3.0 GHz 3.0 GHz × 4
Google Google


PageRank intro

Sensitivity

PageRank intro Random sensitivity
Slide 4 of 41
Inner-Outer

Summary

A cartoon websearch primer

1. Crawl webpages
2. Analyze webpage text (information retrieval)
3. Analyze webpage links
4. Fit measures to human evaluations
5. Produce rankings
6. Continually update

Gleich (Stanford) PageRank intro Ph.D. Defense 5 / 41

1
2
to

3


PageRank by Google
The places we ﬁnd the
surfer most often are im-
portant pages.

3

The Model
2 5 1. follow edges uniformly with
4 probability α, and
2. randomly jump with probability
1 6 1 − α, we’ll assume everywhere
is equally likely


Some PageRank details
3

 
2 5 1/ 6 1/ 2 0 0 0 0
4
 1/ 6 0 0 1/ 3 0 0
P j ≥0
→  1/ 6 1/ 2 0 1/ 3 0 0
 1/ 6 0 1/ 2 0 0 0 eT P=eT
1/ 6 0 1/ 2 1/ 3 0 1
1/ 6 0 0 0 1 0
1 6
P

T ≥0
“jump” → v=[1
n
... 1
n ] eT v=1

Markov chain αP + (1 − α)veT x = x
unique x ⇒ j ≥ 0, eT x = 1.
Linear system ( − αP)x = (1 − α)v
Small detail dangling nodes patched back to v

Other uses for PageRank
What else people use PageRank to do

GeneRank ProteinRank
NM_003748
NM_003862
Contig32125_RC
U82987
AB037863
NM_020974
Contig55377_RC
NM_003882
NM_000849
Contig48328_RC

IsoRank
Contig46223_RC
NM_006117
NM_003239
NM_018401
AF257175
AF201951
NM_001282
Contig63102_RC
NM_000286
Contig34634_RC
NM_000320
AB033007
AL355708
NM_000017
NM_006763
AF148505
Contig57595
NM_001280
AJ224741
U45975
Contig49670_RC
Contig753_RC
Contig25055_RC
Contig53646_RC
Contig42421_RC
Contig51749_RC
AL137514
NM_004911
NM_000224
NM_013262
Contig41887_RC
NM_004163
AB020689
NM_015416
Contig43747_RC
NM_012429
AB033043
AL133619
NM_016569
NM_004480
NM_004798
Contig37063_RC
NM_000507
AB037745
Contig50802_RC
NM_001007
Contig53742_RC
NM_018104
Contig51963
Contig53268_RC
NM_012261
NM_020244
Contig55813_RC
Contig27312_RC
Contig44064_RC
NM_002570
NM_002900
AL050090
NM_015417
Contig47405_RC
NM_016337
Contig55829_RC
Contig37598
Contig45347_RC
NM_020675
NM_003234
AL080110
AL137295
Contig17359_RC
NM_013296
NM_019013
AF052159
Contig55313_RC
NM_002358
NM_004358
Contig50106_RC
NM_005342
NM_014754
U58033
Contig64688
NM_001827
Contig3902_RC
Contig41413_RC
NM_015434
NM_014078
NM_018120
NM_001124
L27560
Contig45816_RC
AL050021
NM_006115
NM_001333
NM_005496
Contig51519_RC
Contig1778_RC
NM_014363
NM_001905
NM_018454
NM_002811
NM_004603
AB032973
NM_006096
D25328
Contig46802_RC
X94232
NM_018004
Contig8581_RC

Clustering
Contig55188_RC
Contig50410
Contig53226_RC
NM_012214
NM_006201
NM_006372
Contig13480_RC
AL137502
Contig40128_RC
NM_003676
NM_013437
Contig2504_RC
AL133603
NM_012177
R70506_RC
NM_003662
NM_018136
NM_000158
NM_018410
Contig21812_RC
NM_004052
Contig4595
Contig60864_RC
NM_003878
U96131
NM_005563
NM_018455
Contig44799_RC
NM_003258
NM_004456
NM_003158
NM_014750
Contig25343_RC
NM_005196
Contig57864_RC
NM_014109
NM_002808
Contig58368_RC
Contig46653_RC
NM_004504
M21551
NM_014875
NM_001168
NM_003376
NM_018098
AF161553
NM_020166
NM_017779
NM_018265
AF155117
NM_004701
NM_006281
Contig44289_RC
NM_004336
Contig33814_RC

(graph partitioning)
NM_003600
NM_006265
NM_000291
NM_000096
NM_001673
NM_001216
NM_014968
NM_018354
NM_007036
NM_004702
Contig2399_RC
NM_001809
Contig20217_RC
NM_003981
NM_007203
NM_006681
AF055033
NM_014889
NM_020386
NM_000599
Contig56457_RC
NM_005915
Contig24252_RC
Contig55725_RC
NM_002916
NM_014321
NM_006931
AL080079
Contig51464_RC
NM_000788
NM_016448
X05610
NM_014791
Contig40831_RC
AK000745
NM_015984
NM_016577
Contig32185_RC
AF052162
AF073519
NM_003607
NM_006101
NM_003875
Contig25991
Contig35251_RC
NM_004994
NM_000436
NM_002073
NM_002019
NM_000127
NM_020188
AL137718
Contig28552_RC
Contig38288_RC
AA555029_RC
NM_016359
Contig46218_RC
Contig63649_RC
AL080059
10 20 30 40 50 60 70

Sports ranking
Use ( − αGD−1 )x = w to
ﬁnd “nearby” important
genes. Teaching

Morrison et al. GeneRank, 2005.

My other projects
Prior PageRank

Parallel Krylov Methods Approximate Personal
Gleich, Zhukov, and Berkhin , Yahoo! Research Labs PageRank
Technical Report, YRL-2004-038; Gleich and Zhukov, Gleich and Polito, Internet Math. 3(3):257 294,
SuperComputing poster, 2005. 2007.
Does existing software work for computing PageRank Can you build a web search engine on your PC?
on a cluster?

Parameterized Matrix
Ongoing

Network Alignment
Problems Come back here for (with Mohsen Bay- j Square
j

s
r

(with Paul Constantine) his defense on Monday, ati, Margot Gerritsen,
June 1st at 1:30pm! Amin Saberi, and Ying
A(s)x(s) = b(s) Wang) t
t
My Software

Packages Publications
MatlabBGL vismatrix Random α PageRank
libbvg parameterized Inner-Outer PageRank
matrix package
gaimc
(with Paul)


PageRank intro

Sensitivity

Sensitivity Random sensitivity
Slide 11 of 41
Inner-Outer

Summary

Which sensitivity?

Sensitivity to the links : examined and understood

Sensitivity to the jump : examined, understood, and useful

Sensitivity to α : less well understood

Gleich (Stanford) Sensitivity Ph.D. Defense 12 / 41

PageRank on Wikipedia
α = 0.50 α = 0.85 α = 0.99
United States United States C:Contents
C:Living people C:Main topic classif. C:Main topic classif.
France C:Contents C:Fundamental
Germany C:Living people United States
England C:Ctgs. by country C:Wikipedia admin.
United Kingdom United Kingdom P:List of portals
Canada C:Fundamental P:Contents/Portals
Japan C:Ctgs. by topic C:Portals
Poland C:Wikipedia admin. C:Society
Australia France C:Ctgs. by topic

Note Top 10 articles on Wikipedia with highest PageRank


The PageRank function
Look at the PageRank vector as a function of α
( − αP)x(α) = (1 − α)v
and examine its derivative.
My Contributions
Gleich, Glynn, Golub, Greif, Dagstuhl proceedings, 2007. Others
Compute the derivative with just PageRank becomes
simple PageRank solves. more sensitive as α → 1.
Empirically evaluated the PageRank vector at
derivative as a rank change α = 1 well deﬁned.
predictor.

α matters!

Golub and Greif, 2004; Boldi et al., 2005; Berkhin, 2005; Langville and Meyer, 2006.

PageRank intro

Random
Sensitivity

sensitivity Random sensitivity

Slide 15 of 41 Inner-Outer

Summary

What is alpha?
Author α
Brin and Page (1998) 0.85
Najork et al. (2007) 0.85
Litvak et al. (2006) 0.5
Experiment (slide 20) 0.375
Algorithms (...) ≥ 0.85

For you, α is clear
Google wants PageRank for everyone

Gleich (Stanford) Random sensitivity Ph.D. Defense 16 / 41

Multiple surfers
Each person picks α from distribution A

...

↓ ↓
x(E [A]) E [x(A)]

x(E [A]) = E [x(A)]


Random alpha PageRank
RAPr

Model PageRank as the random variables

x(A)

and look at
E [x(A)] and Std [x(A)] .

Gleich and Constantine, Workshop on Algorithms on the Web Graph, 2007

What is A?
Beta(0,0,0.6,0.9)
Beta(2,16,0,1)
Beta(1,1,0.1,0.9)
Beta(−0.5,−0.5,0.2,0.7)

0 1

Bet ( , b, , r)


Alpha is
2
Histogram
1.8 Density Fit
Beta(1.5,0.5)
1.6
mean 0.375
1.4
mode 0.25
1.2
density

1

0.8

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
α

Data provided by Abraham Flaxman and Asela Gunawardana at Microsoft.

Example
x1

3 x
2

2 5 x
3

4
x4

1 6
x
5

x
6

0 0.5


What changes?
x(A) A ∼ Bet ( , b, , r) with 0 ≤ < r ≤ 1

1. E [ (A)] ≥ 0 and E [x(A)] = 1;
thus E [x(A)] is a probability distribution.

∞
2. E [x(A)] = ℓ=0
E Aℓ − Aℓ+1 Pℓ v;
thus we can interpret E [x(A)] in length-ℓ paths.

3. for page with no in-links, (A) = (1 − A) ;
thus E [ (A)] = (E [A]) and Std [ (A)] = Std [A]
But is this one useful?


RAPr on Wikipedia
E [x(A)] Std [x(A)]
United States United States
C:Living people C:Living people
France C:Main topic classif.
United Kingdom C:Contents
Germany C:Ctgs. by country
England United Kingdom
Canada France
Japan C:Fundamental
Poland England
Australia C:Ctgs. by topic


Std vs. PageRank
Does it tell us more than just PageRank?
uk2006 — 77M nodes and 2B edges

1 k 1
isim(k) = k =1 2
|Diff[Y(1: ), Z(1: )]|

Disjoint 1
Std[x(A )] vs. x(0.85)
1
Std[x(A2)] vs. x(0.5)
Kendall’s τ
0.8
τ(x(E1 ), S1 ) = +0.3
Intersection Similarity (k)

Std[x(A )] vs. x(0.85)
3

0.6
τ(x(E2 ), S2 ) = −0.5

0.4
τ(x(0.85), S3 ) = −0.2

0.2

Identical 0 0 2 4 6
10 10 10 10
k

A1 ∼ Bet (2, 16, [0, 1]) A2 ∼ Bet (1, 1, [0, 1])
A3 ∼ Bet (0.5, 1.5, [0, 1])

Computation
1. monte carlo
1 N
E [x(A)] = N =1
x(α ) α ∼A

2. path damping
N
E [x(A)] ≈ =0 E A − A +1 P v

3. quadrature
r N
E [x(A)] = x(α) dρ(α) ≈ =1
x(ζ )ω


Time
cnr2000 — 325k nodes and 3M edges

0
10

−5
10

−10
10
Monte Carlo
Path Damping
Quadrature
−15
10 −2 −1 0 1 2 3 4
10 10 10 10 10 10 10
Time (sec)


Convergence theory
Method Conv. Work Required What is N?
1 number of
Monte Carlo N PageRank systems
N samples from A
Path Damping
r N+2 N + 1 matrix vector terms of
(without
N1+ products Neumann series
Std [x(A)])
number of
Gaussian
r 2N N PageRank systems quadrature
Quadrature
points

and r are parameters from Bet ( , b, , r)


Webspam application
Hosts of uk-2006 are labeled as spam, not-spam, other

P R f FP FN
Baseline 0.694 0.558 0.618 0.034 0.442

Beta(0.5,1.5) 0.695 0.561 0.621 0.034 0.439
Beta(1,1) 0.698 0.562 0.622 0.033 0.438
Beta(2,16) 0.699 0.562 0.623 0.033 0.438

Note Bagged (10) J48 decision tree classiﬁer in Weka, mean of 50 repetitions from
10-fold cross-validation of 4948 non-spam and 674 spam hosts (5622 total).
Becchetti et al. Link analysis for Web spam detection, 2008.

PageRank intro

Sensitivity

Inner-Outer Random sensitivity
Slide 29 of 41
Inner-Outer

Summary

Motivation
Why another PageRank algorithm?

For the RAPr codes, we need
1. reliable code
2. fast code over a range of α’s fancy
→ Use Matlab’s “”
3. code for big problems
→ Use a Gauss-Seidel or
custom Richardson method
4. code with only matvec products
→ Use the inner-outer iteration
5. code with only 2 vectors of memory
→ Use the power method simple

Gleich (Stanford) Inner-Outer Ph.D. Defense 30 / 41

Inner-Outer
Note PageRank is easier when α is smaller
Thus Solve PageRank with itself using β < α!

Outer ( − βP)x(k+1) = (α − β)Px(k) + (1 − α)v ≡ f(k)

Inner y(j+1) = βPy(j) + (α − β)Px(k) + (1 − α)v

A new parameter? What is β? 0.5
How many inner iterations? Until a residual of 10−2

Gray, Greif, Lau, 2007.

Inner-Outer algorithm
Input: P, v, α, τ, (β = 0.5, η = 10−2 )
Output: x if 0 ≤ β ≤ α,
1: x ← v convergence with
2: y ← Px any η
3: while αy + (1 − α)v − x 1 ≥ τ
uses only three
4: f ← (α − β)y + (1 − α)v
vectors of memory
5: repeat
6: x ← f + βy β = 0.5, η = 10−2
7: y ← Px often faster than the
8: until f + βy − x 1 < η power method
9: end while (or just a titch slower)
10: x ← αy + (1 − α)v

Note Note that the inner-loop checks its condition after doing one iteration.


Performance
wb−edu, α = 0.85 wb−edu, α = 0.99
0
10 0
10

−1 0
10 10 −1
10 10
0

−2
10 −2 −2
10 10 10
−2

5 10 15 20 20 40
−3 −3
10 10
Residual

Residual
−4 −4
10 10

−5 −5
10 10

−6 −6
10 10
power power
inout inout
−7 −7
10 10
10 20 30 40 50 60 70 80 200 400 600 800 1000 1200
Multiplication Multiplication

τ = 10−7 , β = 0.5, η = 10−2 ;
wb-edu graph (9.8M nodes, 57.M edges)


Extensions

1. A large scale shared-memory parallel version on
compressed web graphs
2. A Gauss-Seidel variant
3. A BiCG-STAB preconditioner
4. A conjecture about the performance of the iteration
5. Showed the algorithm converges for “any” β, η

Gleich, Gray, Greif, Lau, submitted.

Convergence Result
Sketch of convergence result
1. error after j steps of the inner iteration
j−1
α−β
f(j) = αβj−1 Pj + βℓ Pℓ f(0)
β ℓ=1

2. upper bound error by

(α − β) + (1 − α)βj
f(j) ≤ f(0) .
1−β

3. notice
f(j) ≤ α f(0) , j ≥ 1
4. hence, convergence as long as β ≤ α


PageRank intro

Sensitivity

Summary Random sensitivity
Slide 36 of 41
Inner-Outer

Summary

Conclusions

α matters
sensitivity is useful
everything is just PageRank

Gleich (Stanford) Summary Ph.D. Defense 37 / 41

Contributions
1. Derivative
Gleich, Glynn, Golub, Greif, 2007.

New technique to compute the derivative using just PageRank

2. RAPr 3. Inner-Outer
Constantine and Gleich, 2007; Constantine, Gleich,
Gleich, Gray, Greif, Lau, submitted.
and Iaccarino, submitted.

New PageRank model and Improved convergence
sensitivity measure analysis
Range of algorithms and Gauss-Seidel and
algorithmic analysis preconditioning variants
Empirically helpful for Shared-memory parallel
spam identiﬁcation implementation

Robust software Robust software


Thanks!

Michael Saunders (My Advisor)
Hector Garcia-Molina
Chen Greif
Art Owen
Amin Saberi


Margot Gerritsen Debbie Heimowitz
Peter Glynn Jason Azicri
Walter Murray Steven Fan
Reid Andersen Paul Constantine
Pavel Berkhin Michael Atkinson
Kevin Lang Jeremy Kozdon
Amy Langville Esteban Arcaute
Matthew Rasmussen
Sebastiano Vigna
Adam Guetz
Will Fong THANK
Leonid Zhukov Andrew Bradley
Indira Choudhury
Seth Tornborg
Nick Henderson
Chris Maes
YOU
Brian Tempero Nicole Taheri
Prisilla Williams Ying Wang
Deb Michael Nick West
Mayita Romero Kaustuv's Rum
Les Fletcher Saeco Coffee Machine
Hugh Fletcher Napa Valley
Lindsey Fletcher Matlab
Jane Fletcher superlu

Ph.D. Defense: Models and Algorithms for PageRank sensitivity

Ph.D. Defense: Models and Algorithms for PageRank sensitivity

Recommandé

Recommandé

Contenu connexe

Plus de David Gleich

Plus de David Gleich (20)

Dernier

Dernier (20)

Ph.D. Defense: Models and Algorithms for PageRank sensitivity