This document discusses PageRank and HITS algorithms for ranking web pages. It provides an overview of how PageRank calculates prestige scores for pages based on link analysis and describes its strengths in being difficult to spam but also its weakness in not considering topic relevance. It also explains how HITS calculates authority and hub scores for pages based on their inlinks and outlinks, and how authorities and hubs mutually reinforce each other. However, HITS is more susceptible to spam and topic drift than PageRank.
3. PageRank 7.3 Introduction
HITS was presented by Jon Kleinberg in January, 1998 at
the Ninth Annual ACM-SIAM Symposium on Discrete
Algorithms..
PageRank was presented by Sergey Brin and Larry Page
at the Seventh International World Wide Web Conference
(WWW7) in April, 1998.
-Based on the algorithm, they built the search engine
Google
4. PageRank 7.3.1 PageRank Algorithm
PageRank (PR)is a static ranking of Web pages.
PageRank is based on the measure of prestige in social
networks, the PageRank value of each page can be
regarded as its prestige.
5. PageRank 7.3.1 PageRank Algorithm
Concepts:
In-links of page i: These are the hyperlinks that point to
page i from other pages. Usually, hyperlinks from the
same site are not considered.
Out-links of page i: These are the hyperlinks that point
out to other pages from page i. Usually, links to pages of
the same site are not considered.
In-links Out-links
6. PageRank 7.3.1 PageRank Algorithm
uses G=(V, E) [G=graph, V=pages, E=links]
PageRank Score:
※ Oj is the number of
out-links of page j
7. PageRank 7.3.1 PageRank Algorithm
doesn’t not quite suffice.
(隨機性下的發生)
Based on the Markov chain:
※ Aij(1) is the probability of going
from i to j in 1 transition
10. PageRank 7.3.1 PageRank Algorithm
The random surfer has two options:
1. With probability d, he randomly chooses an out-link to follow.
2. With probability 1-d, he jumps to a random page without a link.
Ex3:
12. PageRank 7.3.2 Strengths and Weaknesses
1.The advantage of PageRank is its ability to fight spam.
Since it is not easy for Web page owner to add in-links into
his/her page from other important pages, it is thus not easy
to influence PageRank.
Nevertheless, there are reported ways to influence PageRank.
Recognizing and fighting spam is an important issue in
Web search.
13. PageRank 7.3.2 Strengths and Weaknesses
2. Another major advantage of PageRank is that it is a global
measure and is query independent.
At the query time, only a lookup is needed to find the value
to be integrated with other strategies to rank the pages.
It is thus very efficient at the query time.
14. PageRank 7.3.2 Strengths and Weaknesses
1. The main criticism is also the query-independence nature of
PageRank. It could not distinguish between pages that are
authoritative in general and pages that are authoritative on
the query topic.
15. PageRank 7.3.3 Timed PageRank and Recency Search
The Web is a dynamic environment. It changes constantly.
Quality pages in the past may not be quality pages now or
in the future.
Many outdated pages and links are not deleted. This causes
problems for Web search because such outdated pages
may still be ranked high. - Thus, search has a temporal
dimension.
16. PageRank 7.3.3 Timed PageRank and Recency Search
Time-Sensitive ranking algorithm called TS-Rank.
the surfer can take one of the two actions:
1. With probability f(ti), he randomly chooses an out-going
link to follow.
2. With probability 1-f(ti), he jumps to a random page
without a link.
17. PageRank 7.3.3 Timed PageRank and Recency Search
Time-Sensitive ranking algorithm called TS-Rank.
18. HITS 7.4
Introduction
HITS Algorithm
Finding Other Eigenvectors
Relationships with Co-Citation and
Bibliographic Coupling
Strengths and Weaknesses of HITS
19. HITS 7.4 Introduction
HITS stands for Hypertext Induced Topic Search
Statement :
expands the list of relevant pages returned by a search
engine and then produces two rankings of the expanded
set of pages, authority ranking and hub ranking.
Authority :
a page with many in-links.
A good authority is a page pointed to by many good hubs.
Hub :
a page with many out-links.
A good hub is a page that points to many good authorities.
20. HITS 7.4 Introduction
Authority :
a page with many in-links.
A good authority is a page pointed to by many good hubs.
Hub1
http1
http2
http3….
HubN
http1
http2
http3….
Hub2
http1
http2
http3….
Authority
21. HITS 7.4 Introduction
Hub :
a page with many out-links.
A good hub is a page that points to many good authorities.
Hub
http1
http2
http3….
Authority
1 Authority
2
Authority
N
authorities and hubs have a mutual reinforcement relationship
22. HITS 7.4.1 HITS Algorithm
uses G=(V, E) [G=graph, V=pages, E=links]
計算page i 的authority 分數a(i), hub 分數h(i).
The mutual reinforcing relationship of the two scores is
represented as follows:
23. HITS 7.4.1 HITS Algorithm
Writing them in the matrix form,
a scores = (a(1), a(2), …, a(n))T
h scores = (h(1), h(2), …, h(n))T
a = LT La
h = L LTa
26. HITS 7.4.2 Finding Other Eigenvectors
Each of such collections could potentially be relevant to the
query topic, but they could be well separated from one
another in the graph G for a variety of reasons.
For example,
1. The query string may represent a topic that may arise as
a term in the multiple communities, e.g. “classification”.
2. The query string may refer to a highly polarized issue,
involving groups that are not likely to link to one another,
e.g. “abortion”.
27. HITS 7.4.3 Relationships with Co-Citation and
Bibliographic Coupling
An authority page is like an influential research paper
(publication) which is cited by many subsequent papers.
A hub page is like a survey paper which cites many other
papers (including those influential papers).
28. HITS 7.4.4 Strengths and Weaknesses of HITS
The main strength of HITS is its ability to rank pages
according to the query topic, which may be able to
provide more relevant authority and hub pages.
However, HITS has several disadvantages:
1. HITS does not have the anti-spam capability of PageRank.
2. HITS is topic drift. because people put hyperlinks
for all kinds of reasons, including favor, spamming…
3. The query time evaluation is also a major drawback.
Performing eigenvector computation are all time
consuming operations.
outdated 過時、未更新的
temporal 時間的
For a complete new page in a Web site, which has few or no in-links, we can use the average TS-Rank value of the past pages of the site, which
represents the reputation of the site.