1. Importance of Web Pages
PageRank
Topic-Specific PageRank
Hubs and Authorities
Combatting Link Spam
1
2. PageRank
Intuition: solve the recursive equation:
“a page is important if important pages
link to it.”
In high-falutin’ terms: importance =
the principal eigenvector of the
transition matrix of the Web.
A few fixups needed.
2
3. Transition Matrix of the Web
Enumerate pages.
Page i corresponds to row and column i.
M [i, j ] = 1/n if page j links to n
pages, including page i ; 0 if j does not
link to i.
M [i, j ] is the probability we’ll next be at
page i if we are now at page j.
3
5. Random Walks on the Web
Suppose v is a vector whose i th
component is the probability that each
random walker is at page i at a certain
time.
If each walker follows a link from i at
random, the probability distribution for
walkers is then given by the vector M v.
5
6. Random Walks – (2)
Starting from any vector v, the limit
M (M (…M (M v ) …)) is the long-term
distribution of walkers.
Intuition: pages are important in
proportion to how likely a walker is to
be there.
The math: limiting distribution =
principal eigenvector of M = PageRank.
6
7. Example: The Web in 1839
y a m
Yahoo y 1/2 1/2 0
a 1/2 0 1
m 0 1/2 0
Amazon M’soft
7
8. Solving The Equations
Because there are no constant terms,
the equations v = M v do not have a
unique solution.
In Web-sized examples, we cannot
solve by Gaussian elimination anyway;
we need to use relaxation (= iterative
solution).
Can work if you start with a fixed v.
8
9. Simulating a Random Walk
Start with the vector v = [1, 1,…, 1]
representing the idea that each Web
page is given one unit of importance.
Repeatedly apply the matrix M to v,
allowing the importance to flow like a
random walk.
About 50 iterations is sufficient to
estimate the limiting solution.
9
10. Example: Iterating Equations
Equations v = M v :
y = y /2 + a /2 Note: “=” is
a = y /2 + m really “assignment.”
m = a /2
y 1 1 5/4 9/8 6/5
a = 1 3/2 1 11/8 ... 6/5
m 1 1/2 3/4 1/2 3/5
10
16. Real-World Problems
Some pages are “dead ends” (have no
links out).
Such a page causes importance to leak out.
Other groups of pages are spider traps
(all out-links are within the group).
Eventually spider traps absorb all importance.
16
24. M’soft Becomes Spider Trap
y a m
Yahoo y 1/2 1/2 0
a 1/2 0 0
m 0 1/2 1
Amazon M’soft
24
25. Example: Effect of Spider Trap
Equations v = M v :
y = y /2 + a /2
a = y /2
m = a /2 + m
y 1 1 3/4 5/8 0
a = 1 1/2 1/2 3/8 ... 0
m 1 3/2 7/4 2 3
25
30. PageRank Solution to Traps, Etc.
“Tax” each page a fixed percentage at
each interation.
Add a fixed constant to all pages.
Models a random walk with a fixed
probability of leaving the system, and a
fixed number of new walkers injected
into the system at each step.
30
31. Example: Microsoft is a Spider
Trap; 20% Tax
Equations v = 0.8(M v ) + 0.2:
y = 0.8(y /2 + a/2) + 0.2
a = 0.8(y /2) + 0.2
m = 0.8(a /2 + m) + 0.2
y 1 1.00 0.84 0.776 7/11
a = 1 0.60 0.60 0.536 . . . 5/11
m 1 1.40 1.56 1.688 21/11
31
32. General Case
In this example, because there are no
dead-ends, the total importance
remains at 3.
In examples with dead-ends, some
importance leaks out, but total remains
finite.
32
33. Solving the Equations
Because there are constant terms, we
can expect to solve small examples by
Gaussian elimination.
Web-sized examples still need to be
solved by relaxation.
33
34. Finding a Good Starting Vector
1. Newton-like prediction of where
components of the principal
eigenvector are heading.
2. Take advantage of locality in the Web.
Each technique can reduce the
number of iterations by 50%.
Important – PageRank takes time!
34
35. Predicting Component Values
Three consecutive values for the
importance of a page suggests where
the limit might be.
1.0
Guess for the next round
0.7
0.6 0.55
35
36. Exploiting Substructure
Pages from particular domains, hosts,
or directories, like stanford.edu or
infolab.stanford.edu/~ullman
tend to have many internal links.
Initialize PageRank using ranks within
your local cluster, then ranking the
clusters themselves.
36
37. Strategy
Compute local PageRanks (in parallel?).
Use local weights to establish weights on
edges between clusters.
Compute PageRank on graph of clusters.
Initial rank of a page is the product of its
local rank and the rank of its cluster.
“Clusters” are appropriately sized regions
with common domain or lower-level detail.
37
38. In Pictures
1.5 2.05
3.0 2.0
0.15 0.1
Local ranks
Intercluster weights 0.05
Ranks of clusters
Initial eigenvector 38
39. Topic-Specific Page Rank
Goal: Evaluate Web pages not just
according to their popularity, but by
how close they are to a particular topic,
e.g. “sports” or “history.”
Allows search queries to be answered
based on interests of the user.
Example: Query Maccabi wants different
pages depending on whether you are
interested in sports or history.
39
40. Teleport Sets
Assume each walker has a small
probability of “teleporting” at any tick.
Teleport can go to:
1. Any page with equal probability.
As in the “taxation” scheme.
2. A topic-specific set of “relevant” pages
(teleport set ).
For topic-specific PageRank.
40
41. Example: Topic = Software
Only Microsoft is in the teleport set.
Assume 20% “tax.”
I.e., probability of a teleport is 20%.
41
42. Only Microsoft in Teleport Set
Yahoo
Dr. Who’s
phone
booth.
Amazon M’soft
42
49. Picking the Teleport Set
1. Choose the pages belonging to the
topic in Open Directory.
2. “Learn” from examples the typical
words in pages belonging to the topic;
use pages heavy in those words as the
teleport set.
49
50. Application: Link Spam
Spam farmers today create networks of
millions of pages designed to focus
PageRank on a few undeserving pages.
To minimize their influence, use a
teleport set consisting of trusted pages
only.
Example: home pages of universities.
50
51. Hubs and Authorities
Mutually recursive definition:
A hub links to many authorities;
An authority is linked to by many hubs.
Authorities turn out to be places where
information can be found.
Example: course home pages.
Hubs tell where the authorities are.
Example: CS Dept. course-listing page.
51
52. Transition Matrix A
H&A uses a matrix A [i, j ] = 1 if page i
links to page j, 0 if not.
AT, the transpose of A, is similar to the
PageRank matrix M, but AT has 1’s
where M has fractions.
52
54. Using Matrix A for H&A
Powers of A and AT have elements of
exponential size, so we need scale factors.
Let h and a be vectors measuring the
“hubbiness” and authority of each page.
Equations: h = λAa; a = µAT h.
Hubbiness = scaled sum of authorities of
successor pages (out-links).
Authority = scaled sum of hubbiness of
predecessor pages (in-links).
54
55. Consequences of Basic Equations
From h = λAa; a = µAT h we can
derive:
h = λµAAT h
a = λµATA a
Compute h and a by iteration,
assuming initially each page has one
unit of hubbiness and one unit of
authority.
Pick an appropriate value of λµ.
55
57. Solving H&A in Practice
Iterate as for PageRank; don’t try to
solve equations.
But keep components within bounds.
Example: scale to keep the largest
component of the vector at 1.
Trick: start with h = [1,1,…,1]; multiply
by AT to get first a; scale, then multiply
by A to get next h,…
57
58. Solving H&A – (2)
You may be tempted to compute AAT
and ATA first, then iterate these
matrices as for PageRank.
Bad, because these matrices are not
nearly as sparse as A and AT.
58
59. H&A Versus PageRank
If you talk to someone from IBM, they
may tell you “IBM invented PageRank.”
What they mean is that H&A was invented
by Jon Kleinberg when he was at IBM.
But these are not the same.
H&A does not appear to be a substitute
for PageRank.
But may be used by Ask.com.
59
60. Spam on the Web
Search has become the default
gateway to the web.
Very high premium to appear on the
first page of search results.
60
61. What is Web Spam?
Spamming = any action whose purpose
is to boost a web page’s position in
search engine results, without providing
additional value.
Spam = Web pages used for spamming.
Approximately 10-15% of Web pages
are spam.
61
62. Web-Spam Taxonomy
Boosting techniques :
Techniques for increasing the probability a
Web page will be a highly ranked answer
to a search query.
Hiding techniques :
Techniques to hide the use of boosting
from humans and Web crawlers.
62
63. Hiding techniques
Content hiding.
Use same color for text and page background.
Cloaking.
Return different page to crawlers and browsers.
Redirection.
Redirects are followed by browsers but not
crawlers.
63
64. Boosting Techniques
Term spamming :
Manipulating the text of Web pages in
order to appear relevant to queries.
• Why? You can run ads that are relevant to the
query.
Link spamming :
Creating link structures that boost
PageRank.
64
65. Term Spamming – (1)
Repetition : of one or a few specific
terms e.g., “free,” “cheap,” “Viagra.”
Dumping of a large number of
unrelated terms.
E.g., copy entire dictionaries.
65
66. Term Spamming – (2)
Weaving :
Copy legitimate pages and insert spam
terms at random positions.
Phrase Stitching :
Glue together sentences and phrases
from different sources.
• E.g., use the top-ranked pages on the topic
you want to look like.
66
67. The Google Solution to Term
Spamming
In addition to PageRank, the original
Google engine had another innovation:
it trusted what people said about you in
preference to what you said about
yourself.
Give more weight to words that appear
in or near anchor text than to words
that appear in the page itself.
67
68. The Google Solution – (2)
Today, the Google formula for
matching terms to documents involves
over 250 factors.
E.g., does the word appear in a header?
As closely guarded as the formula for
Coke.
68
69. Link Spam
Three kinds of Web pages from a
spammer’s point of view:
1. Own pages.
• Completely controlled by spammer.
2. Accessible pages.
• E.g., Web-log comment pages: spammer can
post links to his pages.
3. Inaccessible pages.
69
70. Spam Farms – (1)
Spammer’s goal:
Maximize the PageRank of target page t.
Technique:
1. Get as many links from accessible pages
as possible to target page t.
2. Construct “link farm” to get PageRank
multiplier effect.
70
71. Spam Farms – (2)
Accessible Own
1
Inaccessible
Inaccessible
2
t
M
Goal: boost PageRank of page t.
One of the most common and effective
organizations for a spam farm. 71
72. Analysis – (1)
Accessible Own
1
Inaccessible
Inaccessible 2
t
M
Suppose rank from accessible pages = x.
PageRank of target page = y. Share of
“tax”
Taxation rate = 1-β.
Rank of each “farm” page = βy/M + (1-β)/N.
From t Size of 72
Web
73. Analysis – (2)
Accessible Own Tax share
1 for t.
Inaccessible
Inaccessible 2 Very small;
t ignore.
M
y = x + βM[βy/M + (1-β)/N] + (1-β)/N
y = x + β2y + β(1-β)M/N
y = x/(1-β2) + cM/N where c = β/(1+β)
PageRank of
each “farm” page
73
74. Analysis – (3)
Accessible Own
1
Inaccessible
Inaccessible 2
t
M
y = x/(1-β2) + cM/N where c = β/(1+β).
For β = 0.85, 1/(1-β2)= 3.6.
Multiplier effect for “acquired” page rank.
By making M large, we can make y as
large as we want. 74
75. Detecting Link-Spam
Topic-specific PageRank, with a set of
“trusted” pages as the teleport set is
called TrustRank.
Spam Mass =
(PageRank – TrustRank)/PageRank.
High spam mass means most of your
PageRank comes from untrusted sources –
you may be link-spam.
75
76. Picking the Trusted Set
Two conflicting considerations:
Human has to inspect each seed page, so
seed set must be as small as possible.
Must ensure every “good page” gets
adequate TrustRank, so all good pages
should be reachable from the trusted set
by short paths.
76
77. Approaches to Picking the
Trusted Set
1. Pick the top k pages by PageRank.
It is almost impossible to get a spam
page to the very top of the PageRank
order.
2. Pick the home pages of universities.
Domains like .edu are controlled.
77