This document discusses various algorithms for ranking webpages, including early link-based algorithms like InDegree and HITS, as well as more advanced algorithms like PageRank. It notes that early algorithms ranked pages based solely on link analysis or relevance, but modern algorithms like PageRank take a more holistic approach, treating links as endorsements and ranking pages based on both links and relevance to provide more universally relevant results. The document also covers challenges like topic drift, spamming techniques, and difficulties with non-textual content.
2. ž The Problem of Ranking
• Objectives, Challenges
ž Early Assumptions & Approaches
ž Link-Based Ranking Algorithms
• InDegree Algorithm
• Hubs and Authorities: HITS
• PageRank
• SALSA
• Hilltop
ž Search Engine Spamming
ž Problems with Non-textual Context
3. ž “Cornell”
• Did the searcher want information about the
university?
• The university’s hockey team?
• The Lab of Ornithology run by the university?
• Cornell College in Iowa?
• The Nobel-Prize-winning physicist Eric Cornell?
The same ranking of search results can’t be
right for everyone.
4. ž Objectives:
• To categorize webpages
• To find pages related to given pages
• To find duplicated websites
• To calculate the ‘quality’ of a web link
• To get the most ‘relevant’ web links based on a given query
• To model human judgments indirectly
• …
ž Challenges:
• Searching by itself is a hard problem for computers to solve in any
setting
• scale and complexity on the Web
• problems of synonymy and polysemy
• dynamic and constantly-changing nature of Web content
• …
5. ž Back in the 1990’s, web search was purely
based on the number of occurrences of a
word in a document.
ž The search was purely and only based on
relevancy of a document with the query.
Simply getting the relevant documents wasn’t
sufficient as the number of relevant
documents may range in a few millions.
6. ž Links are assumed to be endorsements
• Disagreement
• Self-citation
• Link to a popular document
ž Hyperlinks contain information about the human judgment
of a site
ž The more incoming links to a site, the more it is judged
ž The Web is not a random network
-Bray,Tim. "Measuring the web." Computer networks and ISDN systems 28.7 (1996): 993-1005.
-Marchiori, Massimo. "The quest for correct information on the web: Hyper search engines." Computer
Networks and ISDN Systems 29.8 (1997): 1225-1235.
7. ž Hyperlinks are not at random, they
provide valuable information for:
• Link-based ranking
• Structure analysis
• Detection of communities
• Spam detection
• …
8.
9. ž This approach could be seen as the basis of
each and every link analysis ranking
algorithm.
ž The link recommendation assumption is that
by linking to another page, the author
recommends it.
• So, a page with many incoming links has been highly
recommended.
ž The ranking is just base on the authority and
no weighting of authority values.
12. ž The basic idea is that relevant pages
(“authorities”) are linked to by many other
pages (“hubs”).
ž The algorithm is now a part of the Ask
search engine.
Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,
46(5):604–632, 1999. A preliminary version appears in the Proceedings of the 9th ACM-SIAM
Symposium on Discrete Algorithms, Jan. 1998.
13. ž It is developed by looking at the way how
humans analyze a search process rather
than the machines searching up a query
by looking at a bunch of documents and
return the matches.
ž For example;
• “top automobile makers in the world”
14. ž Rules:
• A good hub points to many good authorities.
• A good authority is pointed to by many good
hubs.
• Authorities and hubs have a mutual
reinforcement relationship.
15.
16. ž Objective: Sq
• (i) Sq is relatively small
• (ii) Sq is rich in relevant pages
• (iii) Sq contains most (or many) of the strongest
authorities
ž Solution
• Generate a Root Set Qσ from text-based search
engine
• Expand the root set
17.
18. ž Let authority score of the page i be x(i),
and the hub score of page i be y(i).
ž mutual reinforcing relationship:
• I step:
• O step:
25. 1. must be built “on the fly”
2. suffers from topic drift
3. cannot detect advertisements
4. can easily be spammed
5. query time evaluation is slow
27. ž Proposed by by Sergey Brin and Lawrence
Page
ž Uses a recursive scheme similar to
Kleinberg’s HITS algorithm
ž But the PageRank algorithm produces a
ranking, independent of a user’s query.
Sergey Brin and Lawrence Page.The anatomy of a large-scale hypertextual Web search
engine. In Proc. 7th International World Wide Web Conference, pages 107–117, 1998.
28. ž A page is important if it is pointed to by
other important pages.
29. ž The PageRank of a page pi is given as
follows:
• Suppose that the page pi has pages M(pi) linking
to it.
• L(pj) is the number of outbound links on page pj.
30.
31.
32. ž The algorithm is robust against Spam
• since its not easy for a webpage owner to add in-
links to his/her page from other important
pages.
ž PageRank is a global measure and is
query independent.
33. ž It favors the older pages
• Since new ones will not have many links
ž PageRank can be easily increased by the
concept of “link-farms”
• However, while indexing, the search actively
tries to find these flaws.
34. ž Rank Sinks: occurs when in a network
pages get in infinite link cycles
ž Spider Traps: occurs if there are no links
from within the group to outside the group.
ž Dangling Links: occurs when a page
contains a link such that the hypertext
points to a page with no outgoing links.
ž Dead Ends: pages with no outgoing links.
35.
36. ž Damping Factor
• random jumps (teleportation)
– where N is the total number of pages
– Typically d ≈ 0.85
37. PAGERANK HITS
ž Computed for all web-
pages stored prior to
the query
ž Computes authorities only
ž Fast to compute
ž No need for additional
normalization
ž Performed on the subset
generated by each query.
ž Computes authorities and
hubs
ž Easy to compute, real-time
execution is hard.
ž There is need for
normalization
38. Criteria HITS PageRank
Complexity Analysis O(kN2) O(n)
Result quality Less than PageRank
algorithm
Medium
Relevancy Less. Since this
algorithm ranks the
pages on the indexing
time
More since this
algorithm uses the
hyperlinks to give good
results and also
consider the content of
the page
Neighborhood applied to the local
neighborhood of pages
surrounding the results
of a query
applied to entire web
Grover, Nidhi, and Ritika Wason. "Comparative analysis of pagerank and hits
algorithms." International Journal of Engineering Research and Technology.Vol. 1.
No. 8 (October-2012). ESRSA Publications, 2012.
39. ž Keyword-Stuffing: Overloading the website with
relevant keywords.
ž Text-Hidding: Placing relevant content on the
website which can only be seen by search engines.
ž Doorway-Page: A page which is very well optimized
for some keywords and with the only purpose to
redirect to a real website.
ž Link-farms: Websites which are optimized for some
keywords and contains only a huge number of links
to other websites.
40. ž Flash: rarely processed by search engines
ž Java Applets: normally not processed.
ž Videos and Images: not directly
processable for search engines.
ž Other Rich-Media Formats: (e.g.
Silverlight) which are typically not
processed by search engines.