SlideShare a Scribd company logo
1 of 62
Download to read offline
RANKING IN WEBSERVICES FOR CONTENT SEARCH OPTIMIZED
IN SPIDER CRAWLING COMB PAGE RANKING USING MACHINE
LEARNING
BY,
NAME : S.THARABAI AMIE, M.Tech - CSE
1
PAGE RANKING USING MACHINE LEARNING
ABSTRACT
Google is often regarded as the first commercially successful search engine. Their ranking was
originally based on the PageRank algorithm. Though PageRank has historically been thought to
perform quite well, there has yet been little academic evidence to support this claim. Page
Ranking basically done by number of ‘hits’ occurred, font formats, colors, size of the fonts
normally measures the quality of the content. PageRank is mere the keyword search. This initial
Page Ranking doesn’t perform any other than these simple measures. It causes number of
incorrect, spamming and malicious sites to be listed for our search. Users rely on search engines
no t only to return pages related to their search query, but also to separate good from the bad, and
order results so that the best pages are suggested first.
Original Page Ranking is analysed on two ways as Query Depending Ranking or Topic –
Sensitive Ranking. Basic idea behind PageRank is simple: a link from a Web page to another,
number of outgoing links, number of back links and it creates a crawling like search. Hence it is
called Spider Crawling web search algorithm. There is a category of Page Ranking using
Machine Learning, in which Static Ranking takes highest priority. It takes static measures that
are played in accuracy. It measures set of features such as Page rank percentage, Popularity of
web pages, Anchor Text, Page values and Domain names. Static ranking order the URLs in the
search engines. It traverse the index from high – quality to low - quality rank. RankNet is
another machine learning algorithm for Page Ranking. It uses regressing algorithm in which set
of feature vectors used to Rank the Pages. Mapping between these vectors rank the web pages.
Page Ranking using Machine Learning is done using methods such as Information Retrieval,
Recommendation Systems, Drug Discovery and Bioinformatics. The last two methods uses the
content search using cognitive analysis. The former two methods uses Instance Ranking, Label
Ranking, Subset Ranking and Rank aggregations. This report focuses on the Static Ranking in
which output urls are obtained from the study on 20000 pages. 150 websites are assigned with
aggregate values of page factors. The output Ranking are plotted on the Decision Tree.
2
1. INTRODUCTION
There are two giant umbrella categories within Machine Learning which are used for
Ranking web pages. They are supervised learning and supervised learning. Briefly
unsupervised learning is like data mining – it looks through some data and see what you can
pull out of it. Actually all our websites contents are opened for search engine’s search. From
the qualified contents of the page it compares lakhs of pages and extracts the most
recommended urls for the search engine’s listing. For a query based search the information is
not readily available before we perform the search. Unsupervised learning uses K – means
algorithm. Static Ranking and Machine learning algorithms are key features for the Page
Ranking. Machine learning algorithms such as Static Ranking and RankNet uses the features
based analyses to answer the search. These all unsupervised learning methods are explained
in this report and how Page Ranking is done based on Static Ranking.
1. KEYWORDS
Machine Learning, Static Ranking, RankNet, PageRank.
2. SUPERVISED LEARNING
Supervised learning, starts with “training data”. Supervised learning is what we do as
children as we learn about the word- ‘apple’. Though this process you come to understand
what an apple is. Not every apple is exactly the same. Had you only ever seen one apple in
your life, you might assume that every apple is identical. You recognize the overall features
of an apple. You’re now able to create this category, or label, of objects in your mind.
3
This is called “generalization” and is a very important concept in supervised learning algorithms.
When we build certain types of ML algorithms we therefore need to be aware of this idea of
generalization. Our algorithms should be able to generalize but not over-generalize (easier said
than done!). Many supervised learning problems are “classification” problems. KNN is one type
of many different classification algorithms.
This is a great time to introduce another important aspect of ML: Its features. Features are what
you break an object down into as you’re processing it for ML. If you’re looking to determine
whether an object is an apple or an orange, for instance, you may want to look at the following
features: shape, size, color, waxiness, surface texture, taste etc. It also turns out that sometimes
an individual feature ends up not being helpful. Apples and oranges are roughly the same size,
so that feature can probably be ignored and save you some computational overhead. In this case,
size doesn’t really add new information to the mix. Knowing what features to look for is an
important skill when designing for ML algorithms.
1.3 Machine Learning for Static Ranking
Since the publication of Brin and Page’s paper on PageRank, many in the Web community have
depended on PageRank for the static (query-independent) ordering of Web pages. We show that
we can significantly outperform PageRank using features that are independent of the link
structure of the Web. We gain a further boost in accuracy by using data on the frequency at which
users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these
and other static features based on anchor text and domain characteristics. The resulting model
achieves a static ranking pairwise accuracy of 67.3% (vs. 56.7% for PageRank or 50% for
random).
4
Over the past decade, the Web has grown exponentially in size. Unfortunately, this growth has
not been isolated to good-quality pages. The number of incorrect, spamming, and malicious (e.g.,
phishing) sites has also grown rapidly. The sheer number of both good and bad pages on the Web
has led to an increasing reliance on search engines for the discovery of useful information. Users
rely on search engines not only to return pages related to their search query, but also to separate
the good from the bad, and order results so that the best pages are suggested first. To date, most
work on Web page ranking has focused on improving the ordering of the results returned to the
user (query- dependent ranking, or dynamic ranking). However, having a good query-
independent ranking (static ranking) is also crucially important for a search engine. A good static
ranking algorithm provides numerous benefits:
• Relevance: The static rank of a page provides a general indicator to the overall quality of the
page.
This is a useful input to the dynamic ranking algorithm.
• Efficiency: Typically, the search engine’s index is ordered by static rank. By traversing the
index from high- quality to low-quality pages, the dynamic ranker may abort the search when it
determines that no later page will have as high of a dynamic rank as those already found. The
more accurate the static rank, the better this early-stopping ability, and hence the quicker the
search engine may respond to queries.
• Crawl Priority: The Web grows and changes as quickly as search engines can crawl it. Search
engines need a way to prioritize their crawl—to determine which pages to re- crawl, how
frequently, and how often to seek out new pages. Among other factors, the static rank of a page is
used to determine this prioritization. A better static rank thus provides the engine with a higher
quality, more up- to-date index.
Google is often regarded as the first commercially successful search engine. Their ranking was
originally based on the PageRank algorithm. Due to this (and possibly due to Google’s
promotion of PageRank to the public), PageRank is widely regarded as the best method for the
static ranking of Web pages.
5
Though PageRank has historically been thought to perform quite well, there has yet been little
academic evidence to support this claim. Even worse, there has recently been work showing that
PageRank may not perform any better than other simple measures on certain tasks. Upstill et al.
have found that for the task of finding home pages, the number of pages linking to a page and the
type of URL were as, or more, effective than PageRank. They found similar results for the task
of finding high quality companies. PageRank has also been used in systems for TREC’s “very
large collection” and “Web track” competitions, but with much less success than had been
expected. Finally, Amento et al found that simple features, such as the number of pages on a site,
performed as well as PageRank. Despite these, the general belief remains among many, both
academic and in the public, that PageRank is an essential factor for a good static rank. Failing
this, it is still assumed that using the link structure is crucial, in the form of the number of inlinks
or the amount of anchor text. In this paper, we show there are a number of simple url- or page-
based features that significantly outperform PageRank (for the purposes of statically ranking
Web pages) despite ignoring the structure of the Web. We combine these and other static features
using machine learning to achieve a ranking system that is significantly better than PageRank (in
pairwise agreement with human labels).
A machine learning approach for static ranking has other advantages besides the quality of the
ranking. Because the measure consists of many features, it is harder for malicious users to
manipulate it (i.e., to raise their page’s static rank to an undeserved level through questionable
techniques, also known as Web spamming). This is particularly true if the feature set is not
known.
In contrast, a single measure like PageRank can be easier to manipulate because spammers need
only concentrate on one goal: how to cause more pages to point to their page. With an algorithm
that learns, a feature that becomes unusable due to spammer manipulation will simply be reduced
or removed from the final computation of rank. This flexibility allows a ranking system to
rapidly react to new spamming techniques. A machine learning approach to static ranking is also
6
able to take advantage of any advances in the machine learning field. For example, recent work
on adversarial classification suggests that it may be possible to explicitly model the Web page
spammer’s (the adversary) actions, adjusting the ranking model in advance of the spammer’s
attempts to circumvent it. Another example is the elimination of outliers in constructing the
model, which helps reduce the effect that unique sites may have on the overall quality of the
static rank. By moving static ranking to a machine learning framework, we not only gain in
accuracy, but also gain in the ability to react to spammer’s actions, to rapidly add new features to
the ranking algorithm, and to leverage advances in the rapidly growing field of machine learning.
Finally, we believe there will be significant advantages to using this technique for other domains,
such as searching a local hard drive or a corporation’s intranet. These are domains where the link
structure is particularly weak (or non-existent), but there are other domain-specific features that
could be just as powerful.
For example, the author of an intranet page and his/her position in the organization (e.g., CEO,
manager, or developer) could provide significant clues as to the importance of that page. A
machine learning approach thus allows rapid development of a good static algorithm in new
domains. This paper’s contribution is a systematic study of static features, including PageRank,
for the purposes of (statically) ranking Web pages. Previous studies on PageRank typically used
subsets of the Web that are significantly smaller (e.g., the TREC VLC2 corpus, used by many,
contains only 19 million pages). Also, the performance of PageRank and other static features has
typically been evaluated in the context of a complete system for dynamic ranking, or for other
tasks such as question answering. In contrast, we explore the use of PageRank and other features
for the direct task of statically ranking Web pages. We first briefly describe the PageRank
algorithm. In Section 3 we introduce RankNet, the machine learning technique used to combine
static features into a final ranking. Section 4 describes the static features. The heart of the paper
is in Section 5, which presents our experiments and results. We conclude with a discussion of
related and future work.
7
2. BASIC PAGE RANKING
The basic idea behind PageRank is simple: a link from a Web page to another can be seen as an
endorsement of that page. In general, links are made by people. As such, they are indicative of
the quality of the pages to which they point – when creating a page, an author presumably
chooses to link to pages deemed to be of good quality. We can take advantage of this linkage
information to order Web pages according to their perceived quality.
Imagine a Web surfer who jumps from Web page to Web page, choosing with uniform
probability which link to follow at each step. In order to reduce the effect of dead-ends or endless
cycles the surfer will occasionally jump to a random page with some small probability α, or
when on a page with no out-links. If averaged over a sufficient number of steps, the probability
the surfer is on page j at some point in time is given by the formula:
P( j) = (1 − α ) / N + α Σ in B j P(i) / |Fi| (1)
Where Fi is the set of pages that page i links to, and Bj is the set of pages that link to page j. The
PageRank score for node j is defined as this probability: PR(j)=P(j). Because equation (1) is
recursive, it must be iteratively evaluated until P(j) converges (typically, the initial distribution
for P(j) is uniform). The intuition is, because a random surfer would end up at the page more
frequently, it is likely a better page. An alternative view for equation (1) is that each page is
assigned a quality, P(j). A page “gives” an equal share of its quality to each page it points to.
PageRank is computationally expensive. Our collection of 5 billion pages contains
approximately 370 billion links. Computing PageRank requires iterating over these billions of
links multiple times (until convergence). It requires large amounts of memory (or very smart
caching schemes that slow the computation down even further), and if spread across multiple
machines, requires significant communication between them. Though much work has been done
on optimizing the PageRank computation.
8
3. RANK NET
Much work in machine learning has been done on the problems of classification and regression.
Let X={xi} be a collection of feature vectors (typically, a feature is any real valued number), and
Y={yi} be a collection of associated classes, where yi is the class of the object described by
feature vector xi. The classification problem is to learn a function f that maps yi=f(xi), for all i.
When yi is real-valued as well, this is called regression. Static ranking can be seen as a
regression problem. If we let xi represent features of page i, and yi be a value (say, the rank) for
each page, we could learn a regression function that mapped each page’s features to their rank.
However, this over-constrains the problem we wish to solve. All we really care about is the
order of the pages, not the actual value assigned to them. Recent work on this ranking problem
directly attempts to optimize the ordering of the objects, rather than the value assigned to them.
For these,
let Z={<i,j>} be a collection of pairs of items, where item i should be assigned a higher
value than item j. The goal of the ranking problem, then, is to learn a function f such that,
¥ i , j € Z, f ( x i ) > f ( x j )
Note that, as with learning a regression function, the result of this process is a function (f) that
maps feature vectors to real values. This function can still be applied anywhere that a regression-
learned function could be applied. The only difference is the technique used to learn the function.
By directly optimizing the ordering of objects, these methods are able to learn a function that
does a better job of ranking than do regression techniques. We used RankNet, one of the
aforementioned techniques for learning ranking functions, to learn our static rank function.
RankNet is a straightforward modification to the standard neural network back-prop algorithm.
As with back-prop, RankNet attempts to minimize the value of a cost function by adjusting each
weight in the network according to the gradient of the cost function with respect to that weight.
The difference is that, while a typical neural network cost function is based on the difference
9
between the network output and the desired output, the RankNet cost function is based on the
difference between a pair of network outputs. That is, for each pair of feature vectors <i,j> in the
training set, RankNet computes the network outputs oi and oj. Since vector i is supposed to be
ranked higher than vector j, the larger is oj-oi, the larger the cost.
RankNet also allows the pairs in Z to be weighted with a confidence (posed as the probability
that the pair satisfies the ordering induced by the ranking function). In this paper, we used a
probability of one for all pairs. In the next section, we will discuss the features used in our
feature vectors, xi.
4. FEATURES
To apply RankNet (or other machine learning techniques) to the ranking problem, we needed to
extract a set of features from each page. We divided our feature set into four, mutually exclusive,
categories: page-level (Page), domain-level (Domain), anchor text and inlinks (Anchor), and
popularity (Popularity). We also optionally used the PageRank of a page as a feature. Below, we
describe each of these feature categories in more detail.
4.1 PageRank
We computed PageRank on a Web graph of 5 billion crawled pages (and 20 billion known URLs
linked to by these pages). This represents a significant portion of the Web, and is approximately
the same number of pages as are used by Google, Yahoo, and MSN for their search engines.
Another source, internal to search engines, are records of which results their users clicked on.
Such data was used by the search engine “Direct Hit”, and has recently been explored for
dynamic ranking purposes. An advantage of the toolbar data over this is that it contains
information about URL visits that are not just the result of a search.
The raw popularity is processed into a number of features such as the number of times a page
was viewed and the number of times any page in the domain was viewed.
10
4.2 Anchor text and inlinks
These features are based on the information associated with links to the page in question. It
includes features such as the total amount of text in links pointing to the page (“anchor text”), the
number of unique words in that text, etc.
4.3 Page
This category consists of features which may be determined by looking at the page (and its URL)
alone. We used only eight, simple features such as the number of words in the body, the
frequency of the most common term, etc.
4.4 Domain
This category contains features that are computed as averages across all pages in the domain. For
example, the average number of outlinks on any page and the average PageRank. Many of these
features have been used by others for ranking Web pages, particularly the anchor and page
features.
As mentioned, the evaluation is typically for dynamic ranking, and we wish to evaluate the use
of them for static ranking. Also, to our knowledge, this is the first study on the use of actual page
visitation popularity for static ranking. The closest similar work is on using click-through
behavior (that is, which search engine results the users click on) to affect dynamic ranking.
Because we use a wide variety of features to come up with a static ranking, we refer to this as
fRank (for feature-based ranking). fRank uses RankNet and the set of features described in this
section to learn a ranking function for Web pages. Unless otherwise specified, fRank was trained
with all of the features.
11
5. EXPERIMENTS
Because PageRank is a graph-based algorithm, it is important In this section, we will
demonstrate that we can out perform that it be run on as large a subset of the Web as possible.
Most PageRank by applying machine learning to a straightforward set previous studies on
PageRank used subsets of the Web that are of features. Before the results, we first discuss the
data, the significantly smaller (e.g. the TREC VLC2 corpus, used by performance metric, and the
training method. many, contains only 19 million pages) We computed PageRank using the
standard value of 0.85 for α.
5.1 Data Popularity
Another feature we used is the actual popularity of a Web page, measured as the number of times
that it has been visited by users over some period of time. We have access to such data from sers
who have installed the MSN toolbar and have opted to provide it to MSN. The data is aggregated
into a count, for each Web page, of the number of users who viewed that page. Though
popularity data is generally unavailable, there are two other sources for it. The first is from proxy
logs. For example, a university that requires its students to use a proxy has a record of all the
pages they have visited while on campus.
Unfortunately, proxy data is quite biased and relatively small. In order to evaluate the quality of a
static ranking, we needed a “gold standard” defining the correct ordering for a set of pages. For
this, we employed a dataset which contains human judgments for 28000 queries. For each query,
a number of results are manually assigned a rating, from 0 to 4, by human judges. The rating is
meant to be a measure of how relevant the result is for the query, where 0 means “poor” and 4
means “excellent”. There are approximately 500k judgments in all, or an average of 18 ratings
per query.
12
The queries are selected by randomly choosing queries from among those issued to the MSN
search engine. The probability that a query is selected is proportional to its frequency among all
of the queries. As a result, common queries are more likely to be judged than uncommon queries.
As an example of how diverse the queries are, the first four queries in the training set are “chef
schools”, “chicagoland speedway”, “eagles fan club”, and “Turkish culture”. The documents
selected for judging are those that we expected would, on average, be reasonably relevant (for
example, the top ten documents returned by MSN’s search engine).
This provides significantly more information than randomly selecting documents on the Web, the
vast majority of which would be irrelevant to a given query. Because of this process, the judged
pages tend to be of higher quality than the average page on the Web, and tend to be pages that
will be returned for common search queries. This bias is good when evaluating the quality of
static ranking for the purposes of index ordering and returning relevant documents.
This is because the most important portion of the index to be well-ordered and relevant is the
portion that is frequently returned for search queries. Because of this bias, however, the results in
this paper are not applicable to crawl prioritization. In order to obtain experimental results on
crawl prioritization, we would need ratings on a random sample of Web pages. To convert the
data from query-dependent to query-independent, we simply removed the query, taking the
maximum over judgments for a URL that appears in more than one query. The reasoning behind
this is that a page that is relevant for some query and irrelevant for another is probably a decent
page and should have a high static rank. Because we evaluated the pages on queries that occur
frequently, our data indicates the correct index ordering, and assigns high value to pages that are
likely to be relevant to a common query.
We randomly assigned queries to a training, validation, or test set, such that they contained 84%,
8%, and 8% of the queries, respectively. Each set contains all of the ratings for a given query,
and no query appears in more than one set. The training set was used to train fRank. The
validation set was used to select the model that had the highest performance. The test set was
13
used for the final results. This data gives us a query-independent ordering of pages. The goal for
a static ranking algorithm will be to reproduce this ordering as closely as possible. In the next
section, we describe the measure we used to evaluate this.
5.2 Measure
We chose to use pairwise accuracy to evaluate the quality of a static ranking. The pairwise
accuracy is the fraction of time that the ranking algorithm and human judges agree on the
ordering of a pair of Web pages. If S(x) is the static ranking assigned to page x, and H(x) is the
human judgment of relevance for x, then consider the following sets:
H p = {x, y : H ( x ) > H ( y )}
and
S p = {x, y : S ( x ) > S ( y )}
The pairwise accuracy is the portion of Hp that is also contained in Sp:
pairwise accuracy = |Hp ∩ Sp |
---------------
|Hp|
This measure was chosen for two reasons. First, the discrete human judgments provide only a
partial ordering over Web pages, making it difficult to apply a measure such as the Spearman
rank order correlation coefficient (in the pairwise accuracy measure, a pair of documents with the
same human judgment does not affect the score). Second, the pairwise accuracy has an intuitive
meaning: it is the fraction of pairs of documents that, when the humans claim one is better than
the other, the static rank algorithm orders them correctly.
14
5.3 Method
We trained fRank (a RankNet based neural network) using the following parameters. We used a
fully connected 2 layer network. The hidden layer had 10 hidden nodes. The input weights to this
layer were all initialized to be zero. The output “layer” (just a single node) weights were
initialized using a uniform random distribution in the range [-0.1, 0.1]. We used tanh as the
transfer function from the inputs to the hidden layer, and a linear function from the hidden layer
to the output. The cost function is the pairwise cross entropy cost function. The features in the
training set were normalized to have zero mean and unit standard deviation. The same linear
transformation was then applied to the features in the validation and test sets. For training, we
presented the network with 5 million pairings of pages, where one page had a higher rating than
the other. The pairings were chosen uniformly at random (with replacement) from all possible
pairings.
When forming the pairs, we ignored the magnitude of the difference between the ratings (the
rating spread) for the two URLs. Hence, the weight for each pair was constant (one), and the
probability of a pair being selected was independent of its rating spread.
We trained the network for 30 epochs. On each epoch, the training pairs were randomly shuffled.
The initial training rate was 0.001. At each epoch, we checked the error on the training set. If the
error had increased, then we decreased the training rate, under the hypothesis that the network
had probably overshot.
15
The training rate at each epoch was thus set to:
Training rate = κ ε +1
Where κ is the initial rate (0.001), and ε is the number of times the training set error has
increased. After each epoch, we measured the performance of the neural network on the
validation set, using 1 million pairs (chosen randomly with replacement). The network with the
highest pairwise accuracy on the validation set was selected, and then tested on the test set. We
report the pairwise accuracy on the test set, calculated using all possible pairs. These parameters
were determined and fixed before the static rank experiments in this paper. In particular, the
choice of initial training rate, number of epochs, and training rate decay function were taken
directly from Burges et al.
Though we had the option of preprocessing any of the features before they were input to the
neural network, we refrained from doing so on most of them. The only exception was the
popularity features. As with most Web phenomenon, we found that the distribution of site
popularity is Zipfian. To reduce the dynamic range, and hopefully make the feature more useful,
we presented the network with both the unpreprocessed, as well as the logarithm, of the
popularity features (As with the others, the logarithmic feature values were also normalized to
have zero mean and unit standard deviation). Applying fRank to a document is computationally
efficient, taking time that is only linear in the number of input features; it is thus within a
constant factor of other simple machine learning methods such as naive Bayes. In our
experiments, computing the fRank for all five billion Web pages was approximately 100 times
faster than computing the PageRank for the same set.
16
5.4 Results
As Table 1 shows, fRank significantly outperforms PageRank for the purposes of static ranking.
With a pairwise accuracy of 67.4%, fRank more than doubles the accuracy of PageRank (relative
to the baseline of 50%, which is the accuracy that would be achieved by a random ordering of
Web pages). Note that one of fRank’s input features is the PageRank of the page, so we would
expect it to perform no worse than PageRank. The significant increase in accuracy implies that
the other features (anchor, popularity, etc.) do in fact contain useful information regarding the
overall quality of a page.
Table 1: Basic Results
Technique Accuracy (%)
-----------------------------------------------
None (Baseline) 50.00
PageRank 56.70
fRank 67.43
------------------------------------------------
There are a number of decisions that go into the computation of PageRank, such as how to deal
with pages that have no outlinks, the choice of α, numeric precision, convergence threshold, etc.
We were able to obtain a computation of PageRank from a completely independent
implementation (provided by Marc Najork) that varied somewhat in these parameters. It
achieved a pairwise accuracy of 56.52%, nearly identical to that obtained by our implementation.
We thus concluded that the quality of the PageRank is not sensitive to these minor variations in
algorithm, nor was PageRank’s low accuracy due to problems with our implementation of it.
We also wanted to find how well each feature set performed. To answer this, for each feature set,
we trained and tested fRank using only that set of features. The results are shown in Table 2. As
can be seen, every single feature set individually outperformed PageRank on this test. Perhaps
17
the most interesting result is that the Page-level features had the highest performance out of all
the feature sets. This is surprising because these are features that do not depend on the overall
graph structure of the Web, nor even on what pages point to a given page. This is contrary to the
common belief that the Web graph structure is the key to finding a good static ranking of Web
pages.
18
Table 2: Results for individual feature sets.
Feature Set Accuracy (%)
----------------------------------------------
PageRank 56.70
Popularity 60.82
Anchor 59.09
Page 63.93
Domain 59.03
----------------------------------------------
All Features 67.43
Because we are using a two-layer neural network, the features in the learned network can interact
with each other in interesting, nonlinear ways. This means that a particular feature that appears to
have little value in isolation could actually be very important when used in combination with
other features. To measure the final contribution of a feature set, in the context of all the other
features, we performed an ablation study. That is, for each set of features, we trained a network
to contain all of the features except that set. We then compared the performance of the resulting
network to the performance of the network with all of the features. Table 3 shows the results of
this experiment, where the “decrease in accuracy” is the difference in pairwise accuracy between
the network trained with all of the features, and the network missing the given feature set.
Table 3: Ablation study.
Shown is the decrease in accuracy when we train a network that has all but the given set of
features. The last line is shows the effect of removing the anchor, PageRank, and domain
features, hence a model containing no network or link-based information whatsoever.
19
Feature Set Decrease in Accuracy
------------------------------------------------
PageRank 0.18
Popularity 0.78
Anchor 0.47
Page 5.42
Domain 0.10
Anchor, PageRank
& Domain 0.60
-------------------------------------------------
The results of the ablation study are consistent with the individual feature set study. Both show
that the most important feature set is the Page-level feature set, and the second most important is
the popularity feature set. Finally, we wished to see how the performance of fRank improved as
we added features; we wanted to find at what point adding more feature sets became relatively
useless.
Beginning with no features, we greedily added the feature set that improved performance the
most.The results are shown in Table 4. For example, the fourth line of the table shows that fRank
using the page, popularity, and anchor features outperformed any network that used the page,
popularity, and some other feature set, and that the performance of this network was 67.25%.
Table 4: fRank performance as feature sets are added. At each row, the feature set that gave the
greatest increase in accuracy was added to the list of features (i.e., we conducted a greedy search
over feature sets).
20
21
Feature Set Accuracy (%)
---------------------------------------
None 50.00
+Page 63.93
+Popularity 66.83
+Anchor 67.25
+PageRank 67.31
+Domain 67.43
Table 5: Top ten URLs for PageRank vs. fRank
PageRank fRank
----------------------------------------------------------------------------------
google.com google.com
apple.com/quicktime/download yahoo.com
amazon.com americanexpress.com
yahoo.com hp.com
microsoft.com/windows/ie target.com
apple.com/quicktime bestbuy.com
mapquest.com dell.com
ebay.com autotrader.com
mozilla.org/products/firefox dogpile.com
ftc.gov bankofamerica.com
Finally, we present a qualitative comparison of PageRank vs. fRank. In Table 5 are the top ten
URLs returned for PageRank and for fRank. PageRank’s results are heavily weighted towards
22
technology sites. It contains two QuickTime URLs (Apple’s video playback software), as well as
Internet Explorer and FireFox URLs (both of which are Web browsers). fRank, on the other
hand, contains more consumer-oriented sites such as American Express, Target, Dell, etc.
PageRank’s bias toward technology can be explained through two processes. First, there are
many pages with “buttons” at the bottom suggesting that the site is optimized for Internet
Explorer, or that the visitor needs QuickTime.
These generally link back to, in these examples, the Internet Explorer and QuickTime download
sites. Consequently, PageRank ranks those pages highly. Though these pages are important, they
are not as important as it may seem by looking at the link structure alone. One fix for this is to
add information about the link to the PageRank computation, such as the size of the text, whether
it was at the bottom of the page, etc.
The other bias comes from the fact that the population of Web site authors is different than the
population of Web users. Web authors tend to be technologically-oriented, and thus their linking
behavior reflects those interests. fRank, by knowing the actual visitation popularity of a site (the
popularity feature set), is able to eliminate some of that bias. It has the ability to depend more on
where actual Web users visit rather than where the Web site authors have linked.
The results confirm that fRank outperforms PageRank in pairwise accuracy. The two most
important feature sets are the page and popularity features. This is surprising, as the page features
consisted only of a few (8) simple features. Further experiments found that, of the page features,
those based on the text of the page (as opposed to the URL) performed the best. In the next
section, we explore the popularity feature in more detail.
23
5.5 Popularity Data
As mentioned in section 4, our popularity data came from MSN toolbar users. For privacy
reasons, we had access only to an aggregate count of, for each URL, how many times it was
visited by any toolbar user. This limited the possible features we could derive from this data.
For possible extensions, see section 6.3, future work. For each URL in our train and test sets, we
provided a feature to fRank which was how many times it had been visited by a toolbar user.
However, this feature was quite noisy and sparse, particularly for URLs with query parameters
(e.g., http://search-.msn.com/results.aspx? q=machine+learning&form=QBHP).
One solution was to provide an additional feature which was the number of times any URL at the
given domain was visited by a toolbar user. Adding this feature dramatically improved the
performance of fRank. We took this one step further and used the built-in hierarchical structure
of URLs to construct many levels of backoff between the full URL and the domain. We did this
by using the set of features shown in Table 6.
Table 6: URL functions used to compute the Popularity feature set.
Function Example
------------------------------------------------------------------------------------
Exact URL cnn.com/2005/tech/wikipedia.html?v=mobile
No Params cnn.com/2005/tech/wikipedia.html
Page wikipedia.html
URL-1 cnn.com/2005/tech
URL-2 cnn.com/2005
...
Domain cnn.com
Domain+1 cnn.com/2005
24
...
Each URL was assigned one feature for each function shown in the table. The value of the
feature was the count of the number of times a toolbar user visited a URL, where the function
applied to that URL matches the function applied to the URL in question. For example, a user’s
visit to cnn.com/2005/sports.html would increment the Domain and Domain+1 features for the
URL cnn.com/2005/tech/wikipedia.html.
As seen in Table 7, adding the domain counts significantly improved the quality of the popularity
feature, and adding the numerous backoff functions listed in Table 6 improved the accuracy even
further.
Table 7: Effect of adding backoff to the popularity feature set
Features Accuracy (%)
------------------------------------------------------------------------------
URL count 58.15
URL and Domain counts 59.31
All backoff functions (Table 6) 60.82
-------------------------------------------------------------------------------
Backing off to subsets of the URL is one technique for dealing with the sparsity of data. It is also
informative to see how the performance of fRank depends on the amount of popularity data that
we have collected. In Figure 1 we show the performance of fRank trained with only the
popularity feature set vs. the amount of data we have for the popularity feature set. Each day, we
receive additional popularity data, and as can be seen in the plot, this increases the performance
of fRank.
25
The relation is logarithmic: doubling the amount of popularity data provides a constant
improvement in pairwise accuracy. In summary, we have found that the popularity features
provide a useful boost to the overall fRank accuracy. Gathering more popularity data, as well as
employing simple backoff strategies, improve this boost even further.
26
5.6 Summary of Results
The experiments provide a number of conclusions. First, fRank performs significantly better than
PageRank, even without any information about the Web graph. Second, the page level and
popularity features were the most significant contributors to pairwise accuracy. Third, by
collecting more popularity data, we can continue to improve fRank’s performance. The
popularity data provides two benefits to fRank. First, we see that qualitatively, fRank’s ordering
of Web pages has a more favorable bias than PageRank’s. fRank’s ordering seems to correspond
to what Web users, rather than Web page authors, prefer. Second, the popularity data is more
timely than PageRank’s link information. The toolbar provides information about which Web
pages people find interesting right now, whereas links are added to pages more slowly, as authors
find the time and interest.
6. RELATED AND FUTURE WORK
6.1 Improvements to PageRank
Since the original PageRank paper, there has been work on improving it. Much of that work
centers on speeding up and parallelizing the computation. One recognized problem with
PageRank is that of topic drift: A page about “dogs” will have high PageRank if it is linked to by
many pages that themselves have high rank, regardless of their topic. In contrast, a search engine
user looking for good pages about dogs would likely prefer to find pages that are pointed to by
many pages that are themselves about dogs. Hence, a link that is “on topic” should have higher
weight than a link that is not.
Richardson and Domingos’s Query Dependent PageRank and Haveliwala’s Topic-Sensitive
PageRank are two approaches that tackle this problem. Other variations to PageRank include
differently weighting links for inter- vs. intra-domain links, adding a backwards step to the
random surfer to simulate the “back” button on most browsers and modifying the jump
27
probability (α) . See Langville and Meyer for a good survey of these, and other modifications to
PageRank.
6.2 Other related work
PageRank is not the only link analysis algorithm used for ranking Web pages. The most well-
known other is HITS, which is used by the Teoma search engine. HITS produces a list of hubs
and authorities, where hubs are pages that point to many authority pages, and authorities are
pages that are pointed to by many hubs. Previous work has shown HITS to perform comparably
to PageRank .
Backing off to subsets of the URL is one technique for dealing with the sparsity of data. It is also
informative to see how the performance of fRank depends on the amount of popularity data that
we have collected. In Figure 1 we show the performance of fRank trained with only the
popularity feature set vs. the amount of data we have for the popularity feature set. Each day, we
receive additional popularity data, and as can be seen in the plot, this increases the performance
of fRank.
The relation is logarithmic: doubling the amount of popularity data provides a constant
improvement in pairwise accuracy.
Figure 1: Relation between the amount of popularity data and the performance of the popularity
feature set. Note the x-axis is a logarithmic scale. Scattered plot dia.
One field of interest is that of static index pruning (see e.g., Carmel et al.). Static index pruning
methods reduce the size of the search engine’s index by removing documents that are unlikely to
be returned by a search query. The pruning is typically done based on the frequency of query
terms. Similarly, Pandey and Olston suggest crawling pages frequently if they are likely to
incorrectly appear (or not appear) as a result of a search.
28
Similar methods could be incorporated into the static rank (e.g., how many frequent queries
contain words found on this page). Others have investigated the effect that PageRank has on the
Web at large. They argue that pages with high PageRank are more likely to be found by Web
users, thus more likely to be linked to, and thus more likely to maintain a higher PageRank than
other pages. The same may occur for the popularity data. If we increase the ranking for popular
pages, they are more likely to be clicked on, thus further increasing their popularity. Cho et al.
argue that a more appropriate measure of Web page quality would depend on not only the current
link structure of the Web, but also on the change in that link structure. The same technique may
be applicable to popularity data: the change in popularity of a page may be more informative
than the absolute popularity.
One interesting related work is that of Ivory and Hearst. Their goal was to build a model of Web
sites that are considered high quality from the perspective of “content, structure and navigation,
visual design, functionality, interactivity, and overall experience”. They used over 100 page level
features, as well as features encompassing the performance and structure of the site. This let
them qualitatively describe the qualities of a page that make it appear attractive (e.g., rare use of
italics, at least 9 point font, ...), and (in later work) to build a system that assists novel Web page
authors in creating quality pages by evaluating it according to these features. The primary
differences between this work and ours are the goal (discovering what constitutes a good Web
page vs. ordering Web pages for the purposes of Web search), the size of the study (they used a
dataset of less than 6000 pages vs. our set of 468,000), and our comparison with PageRank.
Nevertheless, their work provides insights to additional useful static features that we could
incorporate into fRank in the future. Recent work on incorporating novel features into dynamic
ranking includes that by Joachims et al. , who investigate the use of implicit feedback from users,
in the form of which search engine results are clicked on. Craswell et al. present a method for
determining the best transformation to apply to query independent features (such as those used in
this paper) for the purposes of improving dynamic ranking. Other work, such as Boyan et al. and
29
Bartell et al. apply machine learning for the purposes of improving the overall relevance of a
search engine (i.e., the dynamic ranking). They do not apply their techniques to the problem of
static ranking.
30
6.3 Future work
There are many ways in which we would like to extend this work. First, fRank uses only a small
number of features. We believe we could achieve even more significant results with more
features. In particular the existence, or lack thereof, of certain words could prove very significant
(for instance, “under construction” probably signifies a low quality page). Other features could
include the number of images on a page, size of those images, number of layout elements (tables,
divs, and spans), use of style sheets, conforming to W3C standards (like XHTML 1.0 Strict),
background color of a page, etc.
Many pages are generated dynamically, the contents of which may depend on parameters in the
URL, the time of day, the user visiting the site, or other variables. For such pages, it may be
useful to apply the techniques found in to form a static approximation for the purposes of
extracting features. The resulting grammar describing the page could itself be a source of
additional features describing the complexity of the page, such as how many non-terminal nodes
it has, the depth of the grammar tree, etc.
fRank allows one to specify a confidence in each pairing of documents. In the future, we will
experiment with probabilities that depend on the difference in human judgments between the two
items in the pair. For example, a pair of documents where one was rated 4 and the other 0 should
have a higher confidence than a pair of documents rated 3 and 2. The experiments in this paper
are biased toward pages that have higher than average quality. Also, fRank with all of the
features can only be applied to pages that have already been crawled.
Thus, fRank is primarily useful for index ordering and improving relevance, not for directing the
crawl. We would like to investigate a machine learning approach for crawl prioritization as well.
It may be that a combination of methods is best: for example, using PageRank to select the best 5
billion of the 20 billion pages on the Web, then using fRank to order the index and affect search
31
relevancy. Another interesting direction for exploration is to incorporate fRank and page-level
features directly into the PageRank computation itself. Work on biasing the PageRank jump
vector, and transition matrix, have demonstrated the feasibility and advantages of such an
approach.
There is reason to believe that a direct application of, using the fRank of a page for its
“relevance”, could lead to an improved overall static rank. Finally, the popularity data can be
used in other interesting ways. The general surfing and searching habits of Web users varies by
time of day. Activity in the morning, daytime, and evening are often quite different (e.g., reading
the news, solving problems, and accessing entertainment, respectively).
We can gain insight into these differences by using the popularity data, divided into segments of
the day. When a query is issued, we would then use the popularity data matching the time of
query in order to do the ranking of Web pages. We also plan to explore popularity features that
use more than just the counts of how often a page was visited.
For example, how long users tended to dwell on a page, did they leave the page by clicking a
link or by hitting the back button, etc. Fox et al. did a study that showed that features such as this
can be valuable for the purposes of dynamic ranking. Finally, the popularity data could be used
as the label rather than as a feature. Using fRank in this way to predict the popularity of a page
may useful for the tasks of relevance, efficiency, and crawl priority. There is also significantly
more popularity data than human labeled data, potentially enabling more complex machine
learning methods, and significantly more features.
32
7. RANKING METHODS IN MACHINE LEARNING
7. 1 Recommendation Systems
!
33
7. 2 Information Retrieval
!
!
34
7. 3 Drug Discovery
Problem : Millions of structures in a chemical library. How do we identify the most
promising ones?




35
7. 4 Bioinformatics
Human genetics is now at a critical juncture. The molecular methods used
successfully to identify the genes underlying rare mendelian syndromes are failing to find
the numerous genes causing more common, familial, non- mendelian diseases . With the
human genome sequence nearing completion, new opportunities are being presented for
unravelling the complex genetic basis of nonmendelian disorders based on large-scale
genomewide studies .
(
(
36
8. Types of Ranking Problems
➢ Instance Ranking
➢ Label Ranking
➢ Subset Ranking
➢ Rank Aggregation
8. 1 Instance Ranking
! ! ! ! !
( ( ( ( (
37
8. 2 Label Ranking
! !
! !
8. 3 Subset Ranking
( ( ( ( ( ( ( …
! ! ! ! ! ! ! …
38
Query 2
Query 1
8. 4 Rank Aggregation
! ! … (
( ! … !
. . .
9. Theory & Algorithms
1. Bipartite Ranking
2. k-partite Ranking
3. Ranking with Real-Valued Labels
4. General Instance Ranking
Applications
➢ Applications to Bioinformatics
➢ Applications to Drug Discovery
➢ Subset Ranking and Applications to Information Retrieval
39
Query 1
Query 2
9. 1 Bipartite Ranking
! ( ( ( . . .
! ! ! ! ! . . .
9. 2 k-partite Ranking
( ( ( ( . . .
40
9.3 Ranking with Real-Valued Labels
( (
( (
( (
. . .
9.4 General Instance Ranking
! ! ! ! !
! ! ! ! !
41
10. Page Ranking Reports
• Domain Age Report
• Title tag Report
• Keyword Density Report
• Keyword Meta tag Report
• Description Meta tag Report
• Body Text Report
• In page Link Report
• Link Popularity Report
• Outbound Link Report
• IMG ALT Attribute Report
• Server Speed
• Page Rank Report
11. What “k-nearest-neighbor” Means
I think the best way to teach the kNN algorithm is to simply define what the phrase “k-nearest-
neighbor” actually means. That’s what k-nearest-neighbor means. “If the 3 (or 5 or 10, or ‘k’)
nearest neighbors to the mystery pointt.”
Here’s the (simplified) procedure:
• Put all the data you have (including the mystery point) on a graph.
• Measure the distances between the mystery point and every other point.
• Pick a number. Three is usually good for small data sets.
• Figure out what the three closest points to the mystery point are.
• The majority of the three closest points is the answer.
42
43
11. 1 The Code
Let’s start building this thing. There are a few fine points that will come out as we’re
implementing the algorithm, so please read the following carefully. If you skim, you’ll miss out
on another important concept!
I’ll build objects of two different classes for this algorithm: the “Node”, and a “NodeList”. A
Node represents a single data point from the set, whether it’s a pre-labeled (training) point, or an
unknown point. The NodeList manages all the nodes and also does some canvas drawing to
graph them all.
var Node = function(object) {
for (var key in object)
{
this[key] = object[key];
}
};
I build a generalized kNN algorithm that can work with arbitrary features rather than our pre-
defined ones here. I’ll leave that as an exercise for you!
Similarly, the NodeList constructor is simple:
var NodeList = function(k) {
this.nodes = [];
this.k = k;
};
The NodeList constructor takes the “k” from k-nearest-neighbor as its sole argument. Please fork
the JSFiddle code and experiment with different values of k. Don’t forget to try the number 1 as
well!
44
Not shown is a simple NodeList.prototype.add(node) function — that just takes a node and
pushes it onto the this.nodes array. At this point, we could dive right in to calculating distances,
but I’d like to take a quick diversion.
<center><h1 text="blue"> REPORT RESULT FOR KEYWORD SEARH</H1></
center></br>
<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>k-nearest-neighbor-Ajax-coding </title>
<script type="text/javascript" src="/js/lib/dummy.js"></script>
<link rel="stylesheet" type="text/css" href="/css/normalize.css">
<link rel="stylesheet" type="text/css" href="/css/result-light.css">
<style type="text/css">
#canvas {
width:400px;
height:400px;
border:1px solid black;
display:block;
margin:20px auto;
}
</style>
<script type="text/javascript">//<![CDATA[
window.onload=function(){
/*
* Expected keys in object:
* health, hits, type
*/
var Node = function(object) {
for (var key in object)
45
{
this[key] = object[key];
}
};
Node.prototype.measureDistances = function(hits_range_obj, health_range_obj) {
var health_range = health_range_obj.max - health_range_obj.min;
var hits_range = hits_range_obj.max - hits_range_obj.min;
for (var i in this.neighbors)
{
/* Just shortcut syntax */
var neighbor = this.neighbors[i];
var delta_health = neighbor.health - this.health;
delta_health = (delta_health ) / health_range;
var delta_hits = neighbor.hits - this.hits;
delta_hits = (delta_hits ) / hits_range;
neighbor.distance = Math.sqrt( delta_health*delta_health + delta_hits*delta_hits );
}
};
Node.prototype.sortByDistance = function() {
this.neighbors.sort(function (a, b) {
return a.distance - b.distance;
});
};
Node.prototype.guessType = function(k) {
46
var types = {};
for (var i in this.neighbors.slice(0, k))
{
var neighbor = this.neighbors[i];
if ( ! types[neighbor.type] )
{
types[neighbor.type] = 0;
}
types[neighbor.type] += 1;
}
var guess = {type: false, count: 0};
for (var type in types)
{
if (types[type] > guess.count)
{
guess.type = type;
guess.count = types[type];
}
}
this.guess = guess;
return types;
};
47
var NodeList = function(k) {
this.nodes = [];
this.k = k;
};
NodeList.prototype.add = function(node) {
this.nodes.push(node);
};
NodeList.prototype.determineUnknown = function() {
this.calculateRanges();
/*
* Loop through our nodes and look for unknown types.
*/
for (var i in this.nodes)
{
if ( ! this.nodes[i].type)
{
/*
* If the node is an unknown type, clone the nodes list and then measure distances.
*/
/* Clone nodes */
this.nodes[i].neighbors = [];
for (var j in this.nodes)
{
48
if ( ! this.nodes[j].type)
continue;
this.nodes[i].neighbors.push( new Node(this.nodes[j]) );
}
/* Measure distances */
this.nodes[i].measureDistances(this.hitss, this.health);
/* Sort by distance */
this.nodes[i].sortByDistance();
/* Guess type */
console.log(this.nodes[i].guessType(this.k));
}
}
};
NodeList.prototype.calculateRanges = function() {
this.hitss = {min: 1000000, max: 0};
this.health = {min: 1000000, max: 0};
for (var i in this.nodes)
{
if (this.nodes[i].health < this.health.min)
{
this.health.min = this.nodes[i].health;
}
if (this.nodes[i].health > this.health.max)
{
this.health.max = this.nodes[i].health;
49
}
if (this.nodes[i].hits < this.hitss.min)
{
this.hitss.min = this.nodes[i].hits;
}
if (this.nodes[i].hits > this.hitss.max)
{
this.hitss.max = this.nodes[i].hits;
}
}
};
NodeList.prototype.draw = function(canvas_id) {
var health_range = this.health.max - this.health.min;
var hitss_range = this.hitss.max - this.hitss.min;
var canvas = document.getElementById(canvas_id);
var ctx = canvas.getContext("2d");
var width = 400;
var height = 400;
ctx.clearRect(0,0,width, height);
for (var i in this.nodes)
{
ctx.save();
switch (this.nodes[i].type)
{
50
case 'School':
ctx.fillStyle = 'red';
break;
case 'health':
ctx.fillStyle = 'green';
break;
case 'public':
ctx.fillStyle = 'blue';
break;
default:
ctx.fillStyle = '#666666';
}
var padding = 40;
var x_shift_pct = (width - padding) / width;
var y_shift_pct = (height - padding) / height;
var x = (this.nodes[i].health - this.health.min) * (width / health_range) * x_shift_pct +
(padding / 2);
var y = (this.nodes[i].hits - this.hitss.min) * (height / hitss_range) * y_shift_pct +
(padding / 2);
y = Math.abs(y - height);
ctx.translate(x, y);
ctx.beginPath();
ctx.arc(0, 0, 5, 0, Math.PI*2, true);
ctx.fill();
ctx.closePath();
51
/*
* Is this an unknown node? If so, draw the radius of influence
*/
if ( ! this.nodes[i].type )
{
switch (this.nodes[i].guess.type)
{
case 'School':
ctx.strokeStyle = 'red';
break;
case 'public':
ctx.strokeStyle = 'green';
break;
case 'public':
ctx.strokeStyle = 'blue';
break;
default:
ctx.strokeStyle = '#666666';
}
var radius = this.nodes[i].neighbors[this.k - 1].distance * width;
radius *= x_shift_pct;
ctx.beginPath();
ctx.arc(0, 0, radius, 0, Math.PI*2, true);
ctx.stroke();
ctx.closePath();
}
ctx.restore();
52
}
};
var nodes;
var data = [
{url string: 1, hits: 350, type: 'School'},
{url string: 2, hits: 300, type: 'School'},
{url string: 3, hits: 300, type: 'School'},
{url string: 4, hits: 250, type: 'School'},
{url string: 4, hits: 500, type: 'School'},
{url string: 4, hits: 400, type: 'School'},
{url string: 5, hits: 450, type: 'School'},
{url string: 7, hits: 850, type: 'health'},
{url string: 7, hits: 900, type: 'health'},
{url string: 7, hits: 1200, type: 'health'},
{url string: 8, hits: 1500, type: 'health'},
{url string: 9, hits: 1300, type: 'health'},
{url string: 8, hits: 1240, type: 'health'},
{url string: 10, hits: 1700, type: 'health'},
{url string: 9, hits: 1000, type: 'health'},
{url string: 1, hits: 800, type: 'public'},
{url string: 3, hits: 900, type: 'public'},
{url string: 2, hits: 700, type: 'public'},
{url string: 1, hits: 900, type: 'public'},
{url string: 2, hits: 1150, type: 'public'},
{url string: 1, hits: 1000, type: 'public'},
{url string: 2, hits: 1200, type: 'public'},
{url string: 1, hits: 1300, type: 'public'},
];
var run = function() {
53
nodes = new NodeList(3);
for (var i in data)
{
nodes.add( new Node(data[i]) );
}
var random_health = Math.round( Math.random() * 10 );
var random_hits = Math.round( Math.random() * 2000 );
nodes.add( new Node({health: random_health, hits: random_hits, type: false}) );
nodes.determineUnknown();
nodes.draw("canvas");
};
setInterval(run, 500);
run();
}//]]>
</script>
</head>
<body text="blue">
<canvas id="canvas" width="500" height="400"></canvas>
<center><p><img src="C:UserspradeepDesktoprankcanvas.png" width="500"
height="400" border=1></center>
</p>
<center><h3 >
<Search>The Search String is "School of Public Health"</Search></br>
<keyword1>Keyword-School is indicated by Red dots</keyword1></br>
<keyword2>Keyword-public is indicated by Green dots</keyword2></br>
<keyword3>keyword-Health is indicated by Blue dots</keyword3></br>
<keyword4>Other matches are indicated by Black dots</keyword4></br>
54
<match> The circle groups the number of keyword matches</match></br> </h3>
</center>
</body>
</html>
11.2 Graphing the keyword search”SCHOOL OF PUBLIC HEALTH” With Canvas
Graphing our results is pretty straightforward, with a few exceptions:
• We’ll color code school = red, public = green, health = blue.
(
55
12. MACHINE LEARNED URLS IN EDUCATION DOMAIN STUDY
URL RANK
www.emergingedtech.com/ 276
www.bestindiansites.com 269
www.inspirenignite.com 269
http://www.oyeits.com/ 268
www.greenpageclassifieds.com/ 265
www.advertisefree.in 264
www.linkout.in 264
www.jec.in 263
www.collegesintamilnadu .com 263
www.sheppardsoftware.com 263
56
12. 1 DECISION TREE INDUCTION
Table of Contents Page Values
1. Analyzed pages 4.25
2. Page Optimization 3.4
3. URL Optimization 3
4.Website Analytics 3
5. Ranking Monitor 2.7
6.Keywords 2.7
7.Link Manager 2.2
8. Search Engine Compatibility 2.1
9.Website Audit 1.9
10. Link Building 1.86
11.Website Submission 1.8
12. Site Optimization 1.7
13.Client reports 1.7
14. Social Optimization 1.4
15.Inbox Monitor 0.8
16.Competitors spy 0.8
57
12. 2 DECISION TREE BASED ON PAGE VALUES
!
58
13. LITERATURE REVIEW
• Al-Masri & Mahmoud proposed a solution by introducing the term -Web Service
Relevancy Function (WsRF) which is used to measure the relevancy ranking of a specific
Web service using parameters and preference of requester
• Zheng et al. proposed a Web service recommender system (WSRec) which incorporates
user-contribution machinery for Web service information gathering with a hybrid
collective filtering algorithm.
14. CONCLUSION
A good static ranking is an important component for today’s search engines and information
retrieval systems. We have demonstrated that PageRank does not provide a very good static
ranking; there are many simple features that individually out perform PageRank. By combining
many static features, fRank achieves a ranking that has a significantly higher pairwise accuracy
than PageRank alone. A qualitative evaluation of the top documents shows that fRank is less
technology-biased than PageRank; by using popularity data, it is biased toward pages that Web
users, rather than Web authors, visit.
The machine learning component of fRank gives it the additional benefit of being more robust
against spammers, and allows it to leverage further developments in the machine learning
community in areas such as adversarial classification. We have only begun to explore the
options, and believe that significant strides can be made in the area of static ranking by further
experimentation with additional features, other machine learning techniques, and additional
sources of data.
59
15. ACKNOWLEDGEMENT
Thank you to Marc Najork for providing us with additional PageRank computations and to Timo
Burkard for assistance with the popularity data. Many thanks to Chris Burges for providing code
and significant support in using training RankNets. Also, we thank Susan Dumais and Nick
Craswell for their edits and suggestions.
60
16. BIBLIOGRAPHY
➢ http://drona.csa.iisc.ernet.in/~shivani/Events/SDM-10-Tutorial/sdm10-tutorial.pdf
➢ www.burakkanber.com/blog
➢ http://www.stat.uchicago.edu/~lekheng/meetings/mathofranking/slides/
agarwal.pdf
APPENDIX A
1. DATA SET
61
Table of Contents Page Values
1. Analyzed pages 4.25
2. Page Optimization 3.4
3. URL Optimization 3
4.Website Analytics 3
5. Ranking Monitor 2.7
6.Keywords 2.7
7.Link Manager 2.2
8. Search Engine Compatibility 2.1
9.Website Audit 1.9
10. Link Building 1.86
11.Website Submission 1.8
12. Site Optimization 1.7
13.Client reports 1.7
14. Social Optimization 1.4
15.Inbox Monitor 0.8
16.Competitors spy 0.8
62

More Related Content

What's hot

Search engine rampage
Search engine rampageSearch engine rampage
Search engine rampage
Confidential
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
muddywater121
 

What's hot (20)

SEO
SEOSEO
SEO
 
SEO - Emarketing by Surya Mishra
SEO - Emarketing by Surya MishraSEO - Emarketing by Surya Mishra
SEO - Emarketing by Surya Mishra
 
Search Marketing
Search MarketingSearch Marketing
Search Marketing
 
Seo
SeoSeo
Seo
 
Search engine rampage
Search engine rampageSearch engine rampage
Search engine rampage
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
 
Search Engine Manifesto
Search Engine  ManifestoSearch Engine  Manifesto
Search Engine Manifesto
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Emarketing
EmarketingEmarketing
Emarketing
 
Serp analysis guide by letruongan.com
Serp analysis guide by letruongan.comSerp analysis guide by letruongan.com
Serp analysis guide by letruongan.com
 
Emarketing [repaired]
Emarketing [repaired]Emarketing [repaired]
Emarketing [repaired]
 
Search Engine Optimization - Optimize Organic Search
Search Engine Optimization - Optimize Organic SearchSearch Engine Optimization - Optimize Organic Search
Search Engine Optimization - Optimize Organic Search
 
Seo & Internet Marketing
Seo & Internet MarketingSeo & Internet Marketing
Seo & Internet Marketing
 
Emarketing
EmarketingEmarketing
Emarketing
 
Seo
SeoSeo
Seo
 
Seo ppt
Seo pptSeo ppt
Seo ppt
 
prestiva_SEO_knowledge
prestiva_SEO_knowledgeprestiva_SEO_knowledge
prestiva_SEO_knowledge
 
Seo Siloing
Seo SiloingSeo Siloing
Seo Siloing
 
Search Engine Manifesto
Search Engine ManifestoSearch Engine Manifesto
Search Engine Manifesto
 
Search engine manifesto
Search engine manifestoSearch engine manifesto
Search engine manifesto
 

Similar to Page Ranking using Decision tree induction

EXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUES
EXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUESEXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUES
EXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUES
Journal For Research
 

Similar to Page Ranking using Decision tree induction (20)

Search Engine Optimization.pptx
Search Engine Optimization.pptxSearch Engine Optimization.pptx
Search Engine Optimization.pptx
 
how to learn Search Engine Optimization for free
how to learn Search Engine Optimization for freehow to learn Search Engine Optimization for free
how to learn Search Engine Optimization for free
 
Google's page rank; a decision support system in itself.
Google's page rank; a decision support system in itself.Google's page rank; a decision support system in itself.
Google's page rank; a decision support system in itself.
 
Comparative study of different ranking algorithms adopted by search engine
Comparative study of  different ranking algorithms adopted by search engineComparative study of  different ranking algorithms adopted by search engine
Comparative study of different ranking algorithms adopted by search engine
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Introduction of Search Engine & working process.pdf
Introduction of Search Engine & working process.pdfIntroduction of Search Engine & working process.pdf
Introduction of Search Engine & working process.pdf
 
page ranking web crawling
page ranking web crawlingpage ranking web crawling
page ranking web crawling
 
PAGE RANKING
PAGE RANKING PAGE RANKING
PAGE RANKING
 
Search Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEOSearch Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEO
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
 
Page Rank
Page RankPage Rank
Page Rank
 
EXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUES
EXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUESEXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUES
EXPLORATORY REVIEW OF SEARCH ENGINE OPTIMIZATION TECHNIQUES
 
Search engine optimization (seo) overview
Search engine optimization (seo) overviewSearch engine optimization (seo) overview
Search engine optimization (seo) overview
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
 
A Study on SEO Monitoring System Based on Corporate Website Development
A Study on SEO Monitoring System Based on Corporate Website DevelopmentA Study on SEO Monitoring System Based on Corporate Website Development
A Study on SEO Monitoring System Based on Corporate Website Development
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
 
What is seo
What is seoWhat is seo
What is seo
 
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVALCONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
 
Simple explain and Summary about SEO
Simple explain and Summary about SEOSimple explain and Summary about SEO
Simple explain and Summary about SEO
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey report
 

Recently uploaded

VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 

Recently uploaded (20)

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 

Page Ranking using Decision tree induction

  • 1. RANKING IN WEBSERVICES FOR CONTENT SEARCH OPTIMIZED IN SPIDER CRAWLING COMB PAGE RANKING USING MACHINE LEARNING BY, NAME : S.THARABAI AMIE, M.Tech - CSE 1
  • 2. PAGE RANKING USING MACHINE LEARNING ABSTRACT Google is often regarded as the first commercially successful search engine. Their ranking was originally based on the PageRank algorithm. Though PageRank has historically been thought to perform quite well, there has yet been little academic evidence to support this claim. Page Ranking basically done by number of ‘hits’ occurred, font formats, colors, size of the fonts normally measures the quality of the content. PageRank is mere the keyword search. This initial Page Ranking doesn’t perform any other than these simple measures. It causes number of incorrect, spamming and malicious sites to be listed for our search. Users rely on search engines no t only to return pages related to their search query, but also to separate good from the bad, and order results so that the best pages are suggested first. Original Page Ranking is analysed on two ways as Query Depending Ranking or Topic – Sensitive Ranking. Basic idea behind PageRank is simple: a link from a Web page to another, number of outgoing links, number of back links and it creates a crawling like search. Hence it is called Spider Crawling web search algorithm. There is a category of Page Ranking using Machine Learning, in which Static Ranking takes highest priority. It takes static measures that are played in accuracy. It measures set of features such as Page rank percentage, Popularity of web pages, Anchor Text, Page values and Domain names. Static ranking order the URLs in the search engines. It traverse the index from high – quality to low - quality rank. RankNet is another machine learning algorithm for Page Ranking. It uses regressing algorithm in which set of feature vectors used to Rank the Pages. Mapping between these vectors rank the web pages. Page Ranking using Machine Learning is done using methods such as Information Retrieval, Recommendation Systems, Drug Discovery and Bioinformatics. The last two methods uses the content search using cognitive analysis. The former two methods uses Instance Ranking, Label Ranking, Subset Ranking and Rank aggregations. This report focuses on the Static Ranking in which output urls are obtained from the study on 20000 pages. 150 websites are assigned with aggregate values of page factors. The output Ranking are plotted on the Decision Tree. 2
  • 3. 1. INTRODUCTION There are two giant umbrella categories within Machine Learning which are used for Ranking web pages. They are supervised learning and supervised learning. Briefly unsupervised learning is like data mining – it looks through some data and see what you can pull out of it. Actually all our websites contents are opened for search engine’s search. From the qualified contents of the page it compares lakhs of pages and extracts the most recommended urls for the search engine’s listing. For a query based search the information is not readily available before we perform the search. Unsupervised learning uses K – means algorithm. Static Ranking and Machine learning algorithms are key features for the Page Ranking. Machine learning algorithms such as Static Ranking and RankNet uses the features based analyses to answer the search. These all unsupervised learning methods are explained in this report and how Page Ranking is done based on Static Ranking. 1. KEYWORDS Machine Learning, Static Ranking, RankNet, PageRank. 2. SUPERVISED LEARNING Supervised learning, starts with “training data”. Supervised learning is what we do as children as we learn about the word- ‘apple’. Though this process you come to understand what an apple is. Not every apple is exactly the same. Had you only ever seen one apple in your life, you might assume that every apple is identical. You recognize the overall features of an apple. You’re now able to create this category, or label, of objects in your mind. 3
  • 4. This is called “generalization” and is a very important concept in supervised learning algorithms. When we build certain types of ML algorithms we therefore need to be aware of this idea of generalization. Our algorithms should be able to generalize but not over-generalize (easier said than done!). Many supervised learning problems are “classification” problems. KNN is one type of many different classification algorithms. This is a great time to introduce another important aspect of ML: Its features. Features are what you break an object down into as you’re processing it for ML. If you’re looking to determine whether an object is an apple or an orange, for instance, you may want to look at the following features: shape, size, color, waxiness, surface texture, taste etc. It also turns out that sometimes an individual feature ends up not being helpful. Apples and oranges are roughly the same size, so that feature can probably be ignored and save you some computational overhead. In this case, size doesn’t really add new information to the mix. Knowing what features to look for is an important skill when designing for ML algorithms. 1.3 Machine Learning for Static Ranking Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random). 4
  • 5. Over the past decade, the Web has grown exponentially in size. Unfortunately, this growth has not been isolated to good-quality pages. The number of incorrect, spamming, and malicious (e.g., phishing) sites has also grown rapidly. The sheer number of both good and bad pages on the Web has led to an increasing reliance on search engines for the discovery of useful information. Users rely on search engines not only to return pages related to their search query, but also to separate the good from the bad, and order results so that the best pages are suggested first. To date, most work on Web page ranking has focused on improving the ordering of the results returned to the user (query- dependent ranking, or dynamic ranking). However, having a good query- independent ranking (static ranking) is also crucially important for a search engine. A good static ranking algorithm provides numerous benefits: • Relevance: The static rank of a page provides a general indicator to the overall quality of the page. This is a useful input to the dynamic ranking algorithm. • Efficiency: Typically, the search engine’s index is ordered by static rank. By traversing the index from high- quality to low-quality pages, the dynamic ranker may abort the search when it determines that no later page will have as high of a dynamic rank as those already found. The more accurate the static rank, the better this early-stopping ability, and hence the quicker the search engine may respond to queries. • Crawl Priority: The Web grows and changes as quickly as search engines can crawl it. Search engines need a way to prioritize their crawl—to determine which pages to re- crawl, how frequently, and how often to seek out new pages. Among other factors, the static rank of a page is used to determine this prioritization. A better static rank thus provides the engine with a higher quality, more up- to-date index. Google is often regarded as the first commercially successful search engine. Their ranking was originally based on the PageRank algorithm. Due to this (and possibly due to Google’s promotion of PageRank to the public), PageRank is widely regarded as the best method for the static ranking of Web pages. 5
  • 6. Though PageRank has historically been thought to perform quite well, there has yet been little academic evidence to support this claim. Even worse, there has recently been work showing that PageRank may not perform any better than other simple measures on certain tasks. Upstill et al. have found that for the task of finding home pages, the number of pages linking to a page and the type of URL were as, or more, effective than PageRank. They found similar results for the task of finding high quality companies. PageRank has also been used in systems for TREC’s “very large collection” and “Web track” competitions, but with much less success than had been expected. Finally, Amento et al found that simple features, such as the number of pages on a site, performed as well as PageRank. Despite these, the general belief remains among many, both academic and in the public, that PageRank is an essential factor for a good static rank. Failing this, it is still assumed that using the link structure is crucial, in the form of the number of inlinks or the amount of anchor text. In this paper, we show there are a number of simple url- or page- based features that significantly outperform PageRank (for the purposes of statically ranking Web pages) despite ignoring the structure of the Web. We combine these and other static features using machine learning to achieve a ranking system that is significantly better than PageRank (in pairwise agreement with human labels). A machine learning approach for static ranking has other advantages besides the quality of the ranking. Because the measure consists of many features, it is harder for malicious users to manipulate it (i.e., to raise their page’s static rank to an undeserved level through questionable techniques, also known as Web spamming). This is particularly true if the feature set is not known. In contrast, a single measure like PageRank can be easier to manipulate because spammers need only concentrate on one goal: how to cause more pages to point to their page. With an algorithm that learns, a feature that becomes unusable due to spammer manipulation will simply be reduced or removed from the final computation of rank. This flexibility allows a ranking system to rapidly react to new spamming techniques. A machine learning approach to static ranking is also 6
  • 7. able to take advantage of any advances in the machine learning field. For example, recent work on adversarial classification suggests that it may be possible to explicitly model the Web page spammer’s (the adversary) actions, adjusting the ranking model in advance of the spammer’s attempts to circumvent it. Another example is the elimination of outliers in constructing the model, which helps reduce the effect that unique sites may have on the overall quality of the static rank. By moving static ranking to a machine learning framework, we not only gain in accuracy, but also gain in the ability to react to spammer’s actions, to rapidly add new features to the ranking algorithm, and to leverage advances in the rapidly growing field of machine learning. Finally, we believe there will be significant advantages to using this technique for other domains, such as searching a local hard drive or a corporation’s intranet. These are domains where the link structure is particularly weak (or non-existent), but there are other domain-specific features that could be just as powerful. For example, the author of an intranet page and his/her position in the organization (e.g., CEO, manager, or developer) could provide significant clues as to the importance of that page. A machine learning approach thus allows rapid development of a good static algorithm in new domains. This paper’s contribution is a systematic study of static features, including PageRank, for the purposes of (statically) ranking Web pages. Previous studies on PageRank typically used subsets of the Web that are significantly smaller (e.g., the TREC VLC2 corpus, used by many, contains only 19 million pages). Also, the performance of PageRank and other static features has typically been evaluated in the context of a complete system for dynamic ranking, or for other tasks such as question answering. In contrast, we explore the use of PageRank and other features for the direct task of statically ranking Web pages. We first briefly describe the PageRank algorithm. In Section 3 we introduce RankNet, the machine learning technique used to combine static features into a final ranking. Section 4 describes the static features. The heart of the paper is in Section 5, which presents our experiments and results. We conclude with a discussion of related and future work. 7
  • 8. 2. BASIC PAGE RANKING The basic idea behind PageRank is simple: a link from a Web page to another can be seen as an endorsement of that page. In general, links are made by people. As such, they are indicative of the quality of the pages to which they point – when creating a page, an author presumably chooses to link to pages deemed to be of good quality. We can take advantage of this linkage information to order Web pages according to their perceived quality. Imagine a Web surfer who jumps from Web page to Web page, choosing with uniform probability which link to follow at each step. In order to reduce the effect of dead-ends or endless cycles the surfer will occasionally jump to a random page with some small probability α, or when on a page with no out-links. If averaged over a sufficient number of steps, the probability the surfer is on page j at some point in time is given by the formula: P( j) = (1 − α ) / N + α Σ in B j P(i) / |Fi| (1) Where Fi is the set of pages that page i links to, and Bj is the set of pages that link to page j. The PageRank score for node j is defined as this probability: PR(j)=P(j). Because equation (1) is recursive, it must be iteratively evaluated until P(j) converges (typically, the initial distribution for P(j) is uniform). The intuition is, because a random surfer would end up at the page more frequently, it is likely a better page. An alternative view for equation (1) is that each page is assigned a quality, P(j). A page “gives” an equal share of its quality to each page it points to. PageRank is computationally expensive. Our collection of 5 billion pages contains approximately 370 billion links. Computing PageRank requires iterating over these billions of links multiple times (until convergence). It requires large amounts of memory (or very smart caching schemes that slow the computation down even further), and if spread across multiple machines, requires significant communication between them. Though much work has been done on optimizing the PageRank computation. 8
  • 9. 3. RANK NET Much work in machine learning has been done on the problems of classification and regression. Let X={xi} be a collection of feature vectors (typically, a feature is any real valued number), and Y={yi} be a collection of associated classes, where yi is the class of the object described by feature vector xi. The classification problem is to learn a function f that maps yi=f(xi), for all i. When yi is real-valued as well, this is called regression. Static ranking can be seen as a regression problem. If we let xi represent features of page i, and yi be a value (say, the rank) for each page, we could learn a regression function that mapped each page’s features to their rank. However, this over-constrains the problem we wish to solve. All we really care about is the order of the pages, not the actual value assigned to them. Recent work on this ranking problem directly attempts to optimize the ordering of the objects, rather than the value assigned to them. For these, let Z={<i,j>} be a collection of pairs of items, where item i should be assigned a higher value than item j. The goal of the ranking problem, then, is to learn a function f such that, ¥ i , j € Z, f ( x i ) > f ( x j ) Note that, as with learning a regression function, the result of this process is a function (f) that maps feature vectors to real values. This function can still be applied anywhere that a regression- learned function could be applied. The only difference is the technique used to learn the function. By directly optimizing the ordering of objects, these methods are able to learn a function that does a better job of ranking than do regression techniques. We used RankNet, one of the aforementioned techniques for learning ranking functions, to learn our static rank function. RankNet is a straightforward modification to the standard neural network back-prop algorithm. As with back-prop, RankNet attempts to minimize the value of a cost function by adjusting each weight in the network according to the gradient of the cost function with respect to that weight. The difference is that, while a typical neural network cost function is based on the difference 9
  • 10. between the network output and the desired output, the RankNet cost function is based on the difference between a pair of network outputs. That is, for each pair of feature vectors <i,j> in the training set, RankNet computes the network outputs oi and oj. Since vector i is supposed to be ranked higher than vector j, the larger is oj-oi, the larger the cost. RankNet also allows the pairs in Z to be weighted with a confidence (posed as the probability that the pair satisfies the ordering induced by the ranking function). In this paper, we used a probability of one for all pairs. In the next section, we will discuss the features used in our feature vectors, xi. 4. FEATURES To apply RankNet (or other machine learning techniques) to the ranking problem, we needed to extract a set of features from each page. We divided our feature set into four, mutually exclusive, categories: page-level (Page), domain-level (Domain), anchor text and inlinks (Anchor), and popularity (Popularity). We also optionally used the PageRank of a page as a feature. Below, we describe each of these feature categories in more detail. 4.1 PageRank We computed PageRank on a Web graph of 5 billion crawled pages (and 20 billion known URLs linked to by these pages). This represents a significant portion of the Web, and is approximately the same number of pages as are used by Google, Yahoo, and MSN for their search engines. Another source, internal to search engines, are records of which results their users clicked on. Such data was used by the search engine “Direct Hit”, and has recently been explored for dynamic ranking purposes. An advantage of the toolbar data over this is that it contains information about URL visits that are not just the result of a search. The raw popularity is processed into a number of features such as the number of times a page was viewed and the number of times any page in the domain was viewed. 10
  • 11. 4.2 Anchor text and inlinks These features are based on the information associated with links to the page in question. It includes features such as the total amount of text in links pointing to the page (“anchor text”), the number of unique words in that text, etc. 4.3 Page This category consists of features which may be determined by looking at the page (and its URL) alone. We used only eight, simple features such as the number of words in the body, the frequency of the most common term, etc. 4.4 Domain This category contains features that are computed as averages across all pages in the domain. For example, the average number of outlinks on any page and the average PageRank. Many of these features have been used by others for ranking Web pages, particularly the anchor and page features. As mentioned, the evaluation is typically for dynamic ranking, and we wish to evaluate the use of them for static ranking. Also, to our knowledge, this is the first study on the use of actual page visitation popularity for static ranking. The closest similar work is on using click-through behavior (that is, which search engine results the users click on) to affect dynamic ranking. Because we use a wide variety of features to come up with a static ranking, we refer to this as fRank (for feature-based ranking). fRank uses RankNet and the set of features described in this section to learn a ranking function for Web pages. Unless otherwise specified, fRank was trained with all of the features. 11
  • 12. 5. EXPERIMENTS Because PageRank is a graph-based algorithm, it is important In this section, we will demonstrate that we can out perform that it be run on as large a subset of the Web as possible. Most PageRank by applying machine learning to a straightforward set previous studies on PageRank used subsets of the Web that are of features. Before the results, we first discuss the data, the significantly smaller (e.g. the TREC VLC2 corpus, used by performance metric, and the training method. many, contains only 19 million pages) We computed PageRank using the standard value of 0.85 for α. 5.1 Data Popularity Another feature we used is the actual popularity of a Web page, measured as the number of times that it has been visited by users over some period of time. We have access to such data from sers who have installed the MSN toolbar and have opted to provide it to MSN. The data is aggregated into a count, for each Web page, of the number of users who viewed that page. Though popularity data is generally unavailable, there are two other sources for it. The first is from proxy logs. For example, a university that requires its students to use a proxy has a record of all the pages they have visited while on campus. Unfortunately, proxy data is quite biased and relatively small. In order to evaluate the quality of a static ranking, we needed a “gold standard” defining the correct ordering for a set of pages. For this, we employed a dataset which contains human judgments for 28000 queries. For each query, a number of results are manually assigned a rating, from 0 to 4, by human judges. The rating is meant to be a measure of how relevant the result is for the query, where 0 means “poor” and 4 means “excellent”. There are approximately 500k judgments in all, or an average of 18 ratings per query. 12
  • 13. The queries are selected by randomly choosing queries from among those issued to the MSN search engine. The probability that a query is selected is proportional to its frequency among all of the queries. As a result, common queries are more likely to be judged than uncommon queries. As an example of how diverse the queries are, the first four queries in the training set are “chef schools”, “chicagoland speedway”, “eagles fan club”, and “Turkish culture”. The documents selected for judging are those that we expected would, on average, be reasonably relevant (for example, the top ten documents returned by MSN’s search engine). This provides significantly more information than randomly selecting documents on the Web, the vast majority of which would be irrelevant to a given query. Because of this process, the judged pages tend to be of higher quality than the average page on the Web, and tend to be pages that will be returned for common search queries. This bias is good when evaluating the quality of static ranking for the purposes of index ordering and returning relevant documents. This is because the most important portion of the index to be well-ordered and relevant is the portion that is frequently returned for search queries. Because of this bias, however, the results in this paper are not applicable to crawl prioritization. In order to obtain experimental results on crawl prioritization, we would need ratings on a random sample of Web pages. To convert the data from query-dependent to query-independent, we simply removed the query, taking the maximum over judgments for a URL that appears in more than one query. The reasoning behind this is that a page that is relevant for some query and irrelevant for another is probably a decent page and should have a high static rank. Because we evaluated the pages on queries that occur frequently, our data indicates the correct index ordering, and assigns high value to pages that are likely to be relevant to a common query. We randomly assigned queries to a training, validation, or test set, such that they contained 84%, 8%, and 8% of the queries, respectively. Each set contains all of the ratings for a given query, and no query appears in more than one set. The training set was used to train fRank. The validation set was used to select the model that had the highest performance. The test set was 13
  • 14. used for the final results. This data gives us a query-independent ordering of pages. The goal for a static ranking algorithm will be to reproduce this ordering as closely as possible. In the next section, we describe the measure we used to evaluate this. 5.2 Measure We chose to use pairwise accuracy to evaluate the quality of a static ranking. The pairwise accuracy is the fraction of time that the ranking algorithm and human judges agree on the ordering of a pair of Web pages. If S(x) is the static ranking assigned to page x, and H(x) is the human judgment of relevance for x, then consider the following sets: H p = {x, y : H ( x ) > H ( y )} and S p = {x, y : S ( x ) > S ( y )} The pairwise accuracy is the portion of Hp that is also contained in Sp: pairwise accuracy = |Hp ∩ Sp | --------------- |Hp| This measure was chosen for two reasons. First, the discrete human judgments provide only a partial ordering over Web pages, making it difficult to apply a measure such as the Spearman rank order correlation coefficient (in the pairwise accuracy measure, a pair of documents with the same human judgment does not affect the score). Second, the pairwise accuracy has an intuitive meaning: it is the fraction of pairs of documents that, when the humans claim one is better than the other, the static rank algorithm orders them correctly. 14
  • 15. 5.3 Method We trained fRank (a RankNet based neural network) using the following parameters. We used a fully connected 2 layer network. The hidden layer had 10 hidden nodes. The input weights to this layer were all initialized to be zero. The output “layer” (just a single node) weights were initialized using a uniform random distribution in the range [-0.1, 0.1]. We used tanh as the transfer function from the inputs to the hidden layer, and a linear function from the hidden layer to the output. The cost function is the pairwise cross entropy cost function. The features in the training set were normalized to have zero mean and unit standard deviation. The same linear transformation was then applied to the features in the validation and test sets. For training, we presented the network with 5 million pairings of pages, where one page had a higher rating than the other. The pairings were chosen uniformly at random (with replacement) from all possible pairings. When forming the pairs, we ignored the magnitude of the difference between the ratings (the rating spread) for the two URLs. Hence, the weight for each pair was constant (one), and the probability of a pair being selected was independent of its rating spread. We trained the network for 30 epochs. On each epoch, the training pairs were randomly shuffled. The initial training rate was 0.001. At each epoch, we checked the error on the training set. If the error had increased, then we decreased the training rate, under the hypothesis that the network had probably overshot. 15
  • 16. The training rate at each epoch was thus set to: Training rate = κ ε +1 Where κ is the initial rate (0.001), and ε is the number of times the training set error has increased. After each epoch, we measured the performance of the neural network on the validation set, using 1 million pairs (chosen randomly with replacement). The network with the highest pairwise accuracy on the validation set was selected, and then tested on the test set. We report the pairwise accuracy on the test set, calculated using all possible pairs. These parameters were determined and fixed before the static rank experiments in this paper. In particular, the choice of initial training rate, number of epochs, and training rate decay function were taken directly from Burges et al. Though we had the option of preprocessing any of the features before they were input to the neural network, we refrained from doing so on most of them. The only exception was the popularity features. As with most Web phenomenon, we found that the distribution of site popularity is Zipfian. To reduce the dynamic range, and hopefully make the feature more useful, we presented the network with both the unpreprocessed, as well as the logarithm, of the popularity features (As with the others, the logarithmic feature values were also normalized to have zero mean and unit standard deviation). Applying fRank to a document is computationally efficient, taking time that is only linear in the number of input features; it is thus within a constant factor of other simple machine learning methods such as naive Bayes. In our experiments, computing the fRank for all five billion Web pages was approximately 100 times faster than computing the PageRank for the same set. 16
  • 17. 5.4 Results As Table 1 shows, fRank significantly outperforms PageRank for the purposes of static ranking. With a pairwise accuracy of 67.4%, fRank more than doubles the accuracy of PageRank (relative to the baseline of 50%, which is the accuracy that would be achieved by a random ordering of Web pages). Note that one of fRank’s input features is the PageRank of the page, so we would expect it to perform no worse than PageRank. The significant increase in accuracy implies that the other features (anchor, popularity, etc.) do in fact contain useful information regarding the overall quality of a page. Table 1: Basic Results Technique Accuracy (%) ----------------------------------------------- None (Baseline) 50.00 PageRank 56.70 fRank 67.43 ------------------------------------------------ There are a number of decisions that go into the computation of PageRank, such as how to deal with pages that have no outlinks, the choice of α, numeric precision, convergence threshold, etc. We were able to obtain a computation of PageRank from a completely independent implementation (provided by Marc Najork) that varied somewhat in these parameters. It achieved a pairwise accuracy of 56.52%, nearly identical to that obtained by our implementation. We thus concluded that the quality of the PageRank is not sensitive to these minor variations in algorithm, nor was PageRank’s low accuracy due to problems with our implementation of it. We also wanted to find how well each feature set performed. To answer this, for each feature set, we trained and tested fRank using only that set of features. The results are shown in Table 2. As can be seen, every single feature set individually outperformed PageRank on this test. Perhaps 17
  • 18. the most interesting result is that the Page-level features had the highest performance out of all the feature sets. This is surprising because these are features that do not depend on the overall graph structure of the Web, nor even on what pages point to a given page. This is contrary to the common belief that the Web graph structure is the key to finding a good static ranking of Web pages. 18
  • 19. Table 2: Results for individual feature sets. Feature Set Accuracy (%) ---------------------------------------------- PageRank 56.70 Popularity 60.82 Anchor 59.09 Page 63.93 Domain 59.03 ---------------------------------------------- All Features 67.43 Because we are using a two-layer neural network, the features in the learned network can interact with each other in interesting, nonlinear ways. This means that a particular feature that appears to have little value in isolation could actually be very important when used in combination with other features. To measure the final contribution of a feature set, in the context of all the other features, we performed an ablation study. That is, for each set of features, we trained a network to contain all of the features except that set. We then compared the performance of the resulting network to the performance of the network with all of the features. Table 3 shows the results of this experiment, where the “decrease in accuracy” is the difference in pairwise accuracy between the network trained with all of the features, and the network missing the given feature set. Table 3: Ablation study. Shown is the decrease in accuracy when we train a network that has all but the given set of features. The last line is shows the effect of removing the anchor, PageRank, and domain features, hence a model containing no network or link-based information whatsoever. 19
  • 20. Feature Set Decrease in Accuracy ------------------------------------------------ PageRank 0.18 Popularity 0.78 Anchor 0.47 Page 5.42 Domain 0.10 Anchor, PageRank & Domain 0.60 ------------------------------------------------- The results of the ablation study are consistent with the individual feature set study. Both show that the most important feature set is the Page-level feature set, and the second most important is the popularity feature set. Finally, we wished to see how the performance of fRank improved as we added features; we wanted to find at what point adding more feature sets became relatively useless. Beginning with no features, we greedily added the feature set that improved performance the most.The results are shown in Table 4. For example, the fourth line of the table shows that fRank using the page, popularity, and anchor features outperformed any network that used the page, popularity, and some other feature set, and that the performance of this network was 67.25%. Table 4: fRank performance as feature sets are added. At each row, the feature set that gave the greatest increase in accuracy was added to the list of features (i.e., we conducted a greedy search over feature sets). 20
  • 21. 21
  • 22. Feature Set Accuracy (%) --------------------------------------- None 50.00 +Page 63.93 +Popularity 66.83 +Anchor 67.25 +PageRank 67.31 +Domain 67.43 Table 5: Top ten URLs for PageRank vs. fRank PageRank fRank ---------------------------------------------------------------------------------- google.com google.com apple.com/quicktime/download yahoo.com amazon.com americanexpress.com yahoo.com hp.com microsoft.com/windows/ie target.com apple.com/quicktime bestbuy.com mapquest.com dell.com ebay.com autotrader.com mozilla.org/products/firefox dogpile.com ftc.gov bankofamerica.com Finally, we present a qualitative comparison of PageRank vs. fRank. In Table 5 are the top ten URLs returned for PageRank and for fRank. PageRank’s results are heavily weighted towards 22
  • 23. technology sites. It contains two QuickTime URLs (Apple’s video playback software), as well as Internet Explorer and FireFox URLs (both of which are Web browsers). fRank, on the other hand, contains more consumer-oriented sites such as American Express, Target, Dell, etc. PageRank’s bias toward technology can be explained through two processes. First, there are many pages with “buttons” at the bottom suggesting that the site is optimized for Internet Explorer, or that the visitor needs QuickTime. These generally link back to, in these examples, the Internet Explorer and QuickTime download sites. Consequently, PageRank ranks those pages highly. Though these pages are important, they are not as important as it may seem by looking at the link structure alone. One fix for this is to add information about the link to the PageRank computation, such as the size of the text, whether it was at the bottom of the page, etc. The other bias comes from the fact that the population of Web site authors is different than the population of Web users. Web authors tend to be technologically-oriented, and thus their linking behavior reflects those interests. fRank, by knowing the actual visitation popularity of a site (the popularity feature set), is able to eliminate some of that bias. It has the ability to depend more on where actual Web users visit rather than where the Web site authors have linked. The results confirm that fRank outperforms PageRank in pairwise accuracy. The two most important feature sets are the page and popularity features. This is surprising, as the page features consisted only of a few (8) simple features. Further experiments found that, of the page features, those based on the text of the page (as opposed to the URL) performed the best. In the next section, we explore the popularity feature in more detail. 23
  • 24. 5.5 Popularity Data As mentioned in section 4, our popularity data came from MSN toolbar users. For privacy reasons, we had access only to an aggregate count of, for each URL, how many times it was visited by any toolbar user. This limited the possible features we could derive from this data. For possible extensions, see section 6.3, future work. For each URL in our train and test sets, we provided a feature to fRank which was how many times it had been visited by a toolbar user. However, this feature was quite noisy and sparse, particularly for URLs with query parameters (e.g., http://search-.msn.com/results.aspx? q=machine+learning&form=QBHP). One solution was to provide an additional feature which was the number of times any URL at the given domain was visited by a toolbar user. Adding this feature dramatically improved the performance of fRank. We took this one step further and used the built-in hierarchical structure of URLs to construct many levels of backoff between the full URL and the domain. We did this by using the set of features shown in Table 6. Table 6: URL functions used to compute the Popularity feature set. Function Example ------------------------------------------------------------------------------------ Exact URL cnn.com/2005/tech/wikipedia.html?v=mobile No Params cnn.com/2005/tech/wikipedia.html Page wikipedia.html URL-1 cnn.com/2005/tech URL-2 cnn.com/2005 ... Domain cnn.com Domain+1 cnn.com/2005 24
  • 25. ... Each URL was assigned one feature for each function shown in the table. The value of the feature was the count of the number of times a toolbar user visited a URL, where the function applied to that URL matches the function applied to the URL in question. For example, a user’s visit to cnn.com/2005/sports.html would increment the Domain and Domain+1 features for the URL cnn.com/2005/tech/wikipedia.html. As seen in Table 7, adding the domain counts significantly improved the quality of the popularity feature, and adding the numerous backoff functions listed in Table 6 improved the accuracy even further. Table 7: Effect of adding backoff to the popularity feature set Features Accuracy (%) ------------------------------------------------------------------------------ URL count 58.15 URL and Domain counts 59.31 All backoff functions (Table 6) 60.82 ------------------------------------------------------------------------------- Backing off to subsets of the URL is one technique for dealing with the sparsity of data. It is also informative to see how the performance of fRank depends on the amount of popularity data that we have collected. In Figure 1 we show the performance of fRank trained with only the popularity feature set vs. the amount of data we have for the popularity feature set. Each day, we receive additional popularity data, and as can be seen in the plot, this increases the performance of fRank. 25
  • 26. The relation is logarithmic: doubling the amount of popularity data provides a constant improvement in pairwise accuracy. In summary, we have found that the popularity features provide a useful boost to the overall fRank accuracy. Gathering more popularity data, as well as employing simple backoff strategies, improve this boost even further. 26
  • 27. 5.6 Summary of Results The experiments provide a number of conclusions. First, fRank performs significantly better than PageRank, even without any information about the Web graph. Second, the page level and popularity features were the most significant contributors to pairwise accuracy. Third, by collecting more popularity data, we can continue to improve fRank’s performance. The popularity data provides two benefits to fRank. First, we see that qualitatively, fRank’s ordering of Web pages has a more favorable bias than PageRank’s. fRank’s ordering seems to correspond to what Web users, rather than Web page authors, prefer. Second, the popularity data is more timely than PageRank’s link information. The toolbar provides information about which Web pages people find interesting right now, whereas links are added to pages more slowly, as authors find the time and interest. 6. RELATED AND FUTURE WORK 6.1 Improvements to PageRank Since the original PageRank paper, there has been work on improving it. Much of that work centers on speeding up and parallelizing the computation. One recognized problem with PageRank is that of topic drift: A page about “dogs” will have high PageRank if it is linked to by many pages that themselves have high rank, regardless of their topic. In contrast, a search engine user looking for good pages about dogs would likely prefer to find pages that are pointed to by many pages that are themselves about dogs. Hence, a link that is “on topic” should have higher weight than a link that is not. Richardson and Domingos’s Query Dependent PageRank and Haveliwala’s Topic-Sensitive PageRank are two approaches that tackle this problem. Other variations to PageRank include differently weighting links for inter- vs. intra-domain links, adding a backwards step to the random surfer to simulate the “back” button on most browsers and modifying the jump 27
  • 28. probability (α) . See Langville and Meyer for a good survey of these, and other modifications to PageRank. 6.2 Other related work PageRank is not the only link analysis algorithm used for ranking Web pages. The most well- known other is HITS, which is used by the Teoma search engine. HITS produces a list of hubs and authorities, where hubs are pages that point to many authority pages, and authorities are pages that are pointed to by many hubs. Previous work has shown HITS to perform comparably to PageRank . Backing off to subsets of the URL is one technique for dealing with the sparsity of data. It is also informative to see how the performance of fRank depends on the amount of popularity data that we have collected. In Figure 1 we show the performance of fRank trained with only the popularity feature set vs. the amount of data we have for the popularity feature set. Each day, we receive additional popularity data, and as can be seen in the plot, this increases the performance of fRank. The relation is logarithmic: doubling the amount of popularity data provides a constant improvement in pairwise accuracy. Figure 1: Relation between the amount of popularity data and the performance of the popularity feature set. Note the x-axis is a logarithmic scale. Scattered plot dia. One field of interest is that of static index pruning (see e.g., Carmel et al.). Static index pruning methods reduce the size of the search engine’s index by removing documents that are unlikely to be returned by a search query. The pruning is typically done based on the frequency of query terms. Similarly, Pandey and Olston suggest crawling pages frequently if they are likely to incorrectly appear (or not appear) as a result of a search. 28
  • 29. Similar methods could be incorporated into the static rank (e.g., how many frequent queries contain words found on this page). Others have investigated the effect that PageRank has on the Web at large. They argue that pages with high PageRank are more likely to be found by Web users, thus more likely to be linked to, and thus more likely to maintain a higher PageRank than other pages. The same may occur for the popularity data. If we increase the ranking for popular pages, they are more likely to be clicked on, thus further increasing their popularity. Cho et al. argue that a more appropriate measure of Web page quality would depend on not only the current link structure of the Web, but also on the change in that link structure. The same technique may be applicable to popularity data: the change in popularity of a page may be more informative than the absolute popularity. One interesting related work is that of Ivory and Hearst. Their goal was to build a model of Web sites that are considered high quality from the perspective of “content, structure and navigation, visual design, functionality, interactivity, and overall experience”. They used over 100 page level features, as well as features encompassing the performance and structure of the site. This let them qualitatively describe the qualities of a page that make it appear attractive (e.g., rare use of italics, at least 9 point font, ...), and (in later work) to build a system that assists novel Web page authors in creating quality pages by evaluating it according to these features. The primary differences between this work and ours are the goal (discovering what constitutes a good Web page vs. ordering Web pages for the purposes of Web search), the size of the study (they used a dataset of less than 6000 pages vs. our set of 468,000), and our comparison with PageRank. Nevertheless, their work provides insights to additional useful static features that we could incorporate into fRank in the future. Recent work on incorporating novel features into dynamic ranking includes that by Joachims et al. , who investigate the use of implicit feedback from users, in the form of which search engine results are clicked on. Craswell et al. present a method for determining the best transformation to apply to query independent features (such as those used in this paper) for the purposes of improving dynamic ranking. Other work, such as Boyan et al. and 29
  • 30. Bartell et al. apply machine learning for the purposes of improving the overall relevance of a search engine (i.e., the dynamic ranking). They do not apply their techniques to the problem of static ranking. 30
  • 31. 6.3 Future work There are many ways in which we would like to extend this work. First, fRank uses only a small number of features. We believe we could achieve even more significant results with more features. In particular the existence, or lack thereof, of certain words could prove very significant (for instance, “under construction” probably signifies a low quality page). Other features could include the number of images on a page, size of those images, number of layout elements (tables, divs, and spans), use of style sheets, conforming to W3C standards (like XHTML 1.0 Strict), background color of a page, etc. Many pages are generated dynamically, the contents of which may depend on parameters in the URL, the time of day, the user visiting the site, or other variables. For such pages, it may be useful to apply the techniques found in to form a static approximation for the purposes of extracting features. The resulting grammar describing the page could itself be a source of additional features describing the complexity of the page, such as how many non-terminal nodes it has, the depth of the grammar tree, etc. fRank allows one to specify a confidence in each pairing of documents. In the future, we will experiment with probabilities that depend on the difference in human judgments between the two items in the pair. For example, a pair of documents where one was rated 4 and the other 0 should have a higher confidence than a pair of documents rated 3 and 2. The experiments in this paper are biased toward pages that have higher than average quality. Also, fRank with all of the features can only be applied to pages that have already been crawled. Thus, fRank is primarily useful for index ordering and improving relevance, not for directing the crawl. We would like to investigate a machine learning approach for crawl prioritization as well. It may be that a combination of methods is best: for example, using PageRank to select the best 5 billion of the 20 billion pages on the Web, then using fRank to order the index and affect search 31
  • 32. relevancy. Another interesting direction for exploration is to incorporate fRank and page-level features directly into the PageRank computation itself. Work on biasing the PageRank jump vector, and transition matrix, have demonstrated the feasibility and advantages of such an approach. There is reason to believe that a direct application of, using the fRank of a page for its “relevance”, could lead to an improved overall static rank. Finally, the popularity data can be used in other interesting ways. The general surfing and searching habits of Web users varies by time of day. Activity in the morning, daytime, and evening are often quite different (e.g., reading the news, solving problems, and accessing entertainment, respectively). We can gain insight into these differences by using the popularity data, divided into segments of the day. When a query is issued, we would then use the popularity data matching the time of query in order to do the ranking of Web pages. We also plan to explore popularity features that use more than just the counts of how often a page was visited. For example, how long users tended to dwell on a page, did they leave the page by clicking a link or by hitting the back button, etc. Fox et al. did a study that showed that features such as this can be valuable for the purposes of dynamic ranking. Finally, the popularity data could be used as the label rather than as a feature. Using fRank in this way to predict the popularity of a page may useful for the tasks of relevance, efficiency, and crawl priority. There is also significantly more popularity data than human labeled data, potentially enabling more complex machine learning methods, and significantly more features. 32
  • 33. 7. RANKING METHODS IN MACHINE LEARNING 7. 1 Recommendation Systems ! 33
  • 34. 7. 2 Information Retrieval ! ! 34
  • 35. 7. 3 Drug Discovery Problem : Millions of structures in a chemical library. How do we identify the most promising ones? 
 
 35
  • 36. 7. 4 Bioinformatics Human genetics is now at a critical juncture. The molecular methods used successfully to identify the genes underlying rare mendelian syndromes are failing to find the numerous genes causing more common, familial, non- mendelian diseases . With the human genome sequence nearing completion, new opportunities are being presented for unravelling the complex genetic basis of nonmendelian disorders based on large-scale genomewide studies . ( ( 36
  • 37. 8. Types of Ranking Problems ➢ Instance Ranking ➢ Label Ranking ➢ Subset Ranking ➢ Rank Aggregation 8. 1 Instance Ranking ! ! ! ! ! ( ( ( ( ( 37
  • 38. 8. 2 Label Ranking ! ! ! ! 8. 3 Subset Ranking ( ( ( ( ( ( ( … ! ! ! ! ! ! ! … 38 Query 2 Query 1
  • 39. 8. 4 Rank Aggregation ! ! … ( ( ! … ! . . . 9. Theory & Algorithms 1. Bipartite Ranking 2. k-partite Ranking 3. Ranking with Real-Valued Labels 4. General Instance Ranking Applications ➢ Applications to Bioinformatics ➢ Applications to Drug Discovery ➢ Subset Ranking and Applications to Information Retrieval 39 Query 1 Query 2
  • 40. 9. 1 Bipartite Ranking ! ( ( ( . . . ! ! ! ! ! . . . 9. 2 k-partite Ranking ( ( ( ( . . . 40
  • 41. 9.3 Ranking with Real-Valued Labels ( ( ( ( ( ( . . . 9.4 General Instance Ranking ! ! ! ! ! ! ! ! ! ! 41
  • 42. 10. Page Ranking Reports • Domain Age Report • Title tag Report • Keyword Density Report • Keyword Meta tag Report • Description Meta tag Report • Body Text Report • In page Link Report • Link Popularity Report • Outbound Link Report • IMG ALT Attribute Report • Server Speed • Page Rank Report 11. What “k-nearest-neighbor” Means I think the best way to teach the kNN algorithm is to simply define what the phrase “k-nearest- neighbor” actually means. That’s what k-nearest-neighbor means. “If the 3 (or 5 or 10, or ‘k’) nearest neighbors to the mystery pointt.” Here’s the (simplified) procedure: • Put all the data you have (including the mystery point) on a graph. • Measure the distances between the mystery point and every other point. • Pick a number. Three is usually good for small data sets. • Figure out what the three closest points to the mystery point are. • The majority of the three closest points is the answer. 42
  • 43. 43
  • 44. 11. 1 The Code Let’s start building this thing. There are a few fine points that will come out as we’re implementing the algorithm, so please read the following carefully. If you skim, you’ll miss out on another important concept! I’ll build objects of two different classes for this algorithm: the “Node”, and a “NodeList”. A Node represents a single data point from the set, whether it’s a pre-labeled (training) point, or an unknown point. The NodeList manages all the nodes and also does some canvas drawing to graph them all. var Node = function(object) { for (var key in object) { this[key] = object[key]; } }; I build a generalized kNN algorithm that can work with arbitrary features rather than our pre- defined ones here. I’ll leave that as an exercise for you! Similarly, the NodeList constructor is simple: var NodeList = function(k) { this.nodes = []; this.k = k; }; The NodeList constructor takes the “k” from k-nearest-neighbor as its sole argument. Please fork the JSFiddle code and experiment with different values of k. Don’t forget to try the number 1 as well! 44
  • 45. Not shown is a simple NodeList.prototype.add(node) function — that just takes a node and pushes it onto the this.nodes array. At this point, we could dive right in to calculating distances, but I’d like to take a quick diversion. <center><h1 text="blue"> REPORT RESULT FOR KEYWORD SEARH</H1></ center></br> <html><head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <title>k-nearest-neighbor-Ajax-coding </title> <script type="text/javascript" src="/js/lib/dummy.js"></script> <link rel="stylesheet" type="text/css" href="/css/normalize.css"> <link rel="stylesheet" type="text/css" href="/css/result-light.css"> <style type="text/css"> #canvas { width:400px; height:400px; border:1px solid black; display:block; margin:20px auto; } </style> <script type="text/javascript">//<![CDATA[ window.onload=function(){ /* * Expected keys in object: * health, hits, type */ var Node = function(object) { for (var key in object) 45
  • 46. { this[key] = object[key]; } }; Node.prototype.measureDistances = function(hits_range_obj, health_range_obj) { var health_range = health_range_obj.max - health_range_obj.min; var hits_range = hits_range_obj.max - hits_range_obj.min; for (var i in this.neighbors) { /* Just shortcut syntax */ var neighbor = this.neighbors[i]; var delta_health = neighbor.health - this.health; delta_health = (delta_health ) / health_range; var delta_hits = neighbor.hits - this.hits; delta_hits = (delta_hits ) / hits_range; neighbor.distance = Math.sqrt( delta_health*delta_health + delta_hits*delta_hits ); } }; Node.prototype.sortByDistance = function() { this.neighbors.sort(function (a, b) { return a.distance - b.distance; }); }; Node.prototype.guessType = function(k) { 46
  • 47. var types = {}; for (var i in this.neighbors.slice(0, k)) { var neighbor = this.neighbors[i]; if ( ! types[neighbor.type] ) { types[neighbor.type] = 0; } types[neighbor.type] += 1; } var guess = {type: false, count: 0}; for (var type in types) { if (types[type] > guess.count) { guess.type = type; guess.count = types[type]; } } this.guess = guess; return types; }; 47
  • 48. var NodeList = function(k) { this.nodes = []; this.k = k; }; NodeList.prototype.add = function(node) { this.nodes.push(node); }; NodeList.prototype.determineUnknown = function() { this.calculateRanges(); /* * Loop through our nodes and look for unknown types. */ for (var i in this.nodes) { if ( ! this.nodes[i].type) { /* * If the node is an unknown type, clone the nodes list and then measure distances. */ /* Clone nodes */ this.nodes[i].neighbors = []; for (var j in this.nodes) { 48
  • 49. if ( ! this.nodes[j].type) continue; this.nodes[i].neighbors.push( new Node(this.nodes[j]) ); } /* Measure distances */ this.nodes[i].measureDistances(this.hitss, this.health); /* Sort by distance */ this.nodes[i].sortByDistance(); /* Guess type */ console.log(this.nodes[i].guessType(this.k)); } } }; NodeList.prototype.calculateRanges = function() { this.hitss = {min: 1000000, max: 0}; this.health = {min: 1000000, max: 0}; for (var i in this.nodes) { if (this.nodes[i].health < this.health.min) { this.health.min = this.nodes[i].health; } if (this.nodes[i].health > this.health.max) { this.health.max = this.nodes[i].health; 49
  • 50. } if (this.nodes[i].hits < this.hitss.min) { this.hitss.min = this.nodes[i].hits; } if (this.nodes[i].hits > this.hitss.max) { this.hitss.max = this.nodes[i].hits; } } }; NodeList.prototype.draw = function(canvas_id) { var health_range = this.health.max - this.health.min; var hitss_range = this.hitss.max - this.hitss.min; var canvas = document.getElementById(canvas_id); var ctx = canvas.getContext("2d"); var width = 400; var height = 400; ctx.clearRect(0,0,width, height); for (var i in this.nodes) { ctx.save(); switch (this.nodes[i].type) { 50
  • 51. case 'School': ctx.fillStyle = 'red'; break; case 'health': ctx.fillStyle = 'green'; break; case 'public': ctx.fillStyle = 'blue'; break; default: ctx.fillStyle = '#666666'; } var padding = 40; var x_shift_pct = (width - padding) / width; var y_shift_pct = (height - padding) / height; var x = (this.nodes[i].health - this.health.min) * (width / health_range) * x_shift_pct + (padding / 2); var y = (this.nodes[i].hits - this.hitss.min) * (height / hitss_range) * y_shift_pct + (padding / 2); y = Math.abs(y - height); ctx.translate(x, y); ctx.beginPath(); ctx.arc(0, 0, 5, 0, Math.PI*2, true); ctx.fill(); ctx.closePath(); 51
  • 52. /* * Is this an unknown node? If so, draw the radius of influence */ if ( ! this.nodes[i].type ) { switch (this.nodes[i].guess.type) { case 'School': ctx.strokeStyle = 'red'; break; case 'public': ctx.strokeStyle = 'green'; break; case 'public': ctx.strokeStyle = 'blue'; break; default: ctx.strokeStyle = '#666666'; } var radius = this.nodes[i].neighbors[this.k - 1].distance * width; radius *= x_shift_pct; ctx.beginPath(); ctx.arc(0, 0, radius, 0, Math.PI*2, true); ctx.stroke(); ctx.closePath(); } ctx.restore(); 52
  • 53. } }; var nodes; var data = [ {url string: 1, hits: 350, type: 'School'}, {url string: 2, hits: 300, type: 'School'}, {url string: 3, hits: 300, type: 'School'}, {url string: 4, hits: 250, type: 'School'}, {url string: 4, hits: 500, type: 'School'}, {url string: 4, hits: 400, type: 'School'}, {url string: 5, hits: 450, type: 'School'}, {url string: 7, hits: 850, type: 'health'}, {url string: 7, hits: 900, type: 'health'}, {url string: 7, hits: 1200, type: 'health'}, {url string: 8, hits: 1500, type: 'health'}, {url string: 9, hits: 1300, type: 'health'}, {url string: 8, hits: 1240, type: 'health'}, {url string: 10, hits: 1700, type: 'health'}, {url string: 9, hits: 1000, type: 'health'}, {url string: 1, hits: 800, type: 'public'}, {url string: 3, hits: 900, type: 'public'}, {url string: 2, hits: 700, type: 'public'}, {url string: 1, hits: 900, type: 'public'}, {url string: 2, hits: 1150, type: 'public'}, {url string: 1, hits: 1000, type: 'public'}, {url string: 2, hits: 1200, type: 'public'}, {url string: 1, hits: 1300, type: 'public'}, ]; var run = function() { 53
  • 54. nodes = new NodeList(3); for (var i in data) { nodes.add( new Node(data[i]) ); } var random_health = Math.round( Math.random() * 10 ); var random_hits = Math.round( Math.random() * 2000 ); nodes.add( new Node({health: random_health, hits: random_hits, type: false}) ); nodes.determineUnknown(); nodes.draw("canvas"); }; setInterval(run, 500); run(); }//]]> </script> </head> <body text="blue"> <canvas id="canvas" width="500" height="400"></canvas> <center><p><img src="C:UserspradeepDesktoprankcanvas.png" width="500" height="400" border=1></center> </p> <center><h3 > <Search>The Search String is "School of Public Health"</Search></br> <keyword1>Keyword-School is indicated by Red dots</keyword1></br> <keyword2>Keyword-public is indicated by Green dots</keyword2></br> <keyword3>keyword-Health is indicated by Blue dots</keyword3></br> <keyword4>Other matches are indicated by Black dots</keyword4></br> 54
  • 55. <match> The circle groups the number of keyword matches</match></br> </h3> </center> </body> </html> 11.2 Graphing the keyword search”SCHOOL OF PUBLIC HEALTH” With Canvas Graphing our results is pretty straightforward, with a few exceptions: • We’ll color code school = red, public = green, health = blue. ( 55
  • 56. 12. MACHINE LEARNED URLS IN EDUCATION DOMAIN STUDY URL RANK www.emergingedtech.com/ 276 www.bestindiansites.com 269 www.inspirenignite.com 269 http://www.oyeits.com/ 268 www.greenpageclassifieds.com/ 265 www.advertisefree.in 264 www.linkout.in 264 www.jec.in 263 www.collegesintamilnadu .com 263 www.sheppardsoftware.com 263 56
  • 57. 12. 1 DECISION TREE INDUCTION Table of Contents Page Values 1. Analyzed pages 4.25 2. Page Optimization 3.4 3. URL Optimization 3 4.Website Analytics 3 5. Ranking Monitor 2.7 6.Keywords 2.7 7.Link Manager 2.2 8. Search Engine Compatibility 2.1 9.Website Audit 1.9 10. Link Building 1.86 11.Website Submission 1.8 12. Site Optimization 1.7 13.Client reports 1.7 14. Social Optimization 1.4 15.Inbox Monitor 0.8 16.Competitors spy 0.8 57
  • 58. 12. 2 DECISION TREE BASED ON PAGE VALUES ! 58
  • 59. 13. LITERATURE REVIEW • Al-Masri & Mahmoud proposed a solution by introducing the term -Web Service Relevancy Function (WsRF) which is used to measure the relevancy ranking of a specific Web service using parameters and preference of requester • Zheng et al. proposed a Web service recommender system (WSRec) which incorporates user-contribution machinery for Web service information gathering with a hybrid collective filtering algorithm. 14. CONCLUSION A good static ranking is an important component for today’s search engines and information retrieval systems. We have demonstrated that PageRank does not provide a very good static ranking; there are many simple features that individually out perform PageRank. By combining many static features, fRank achieves a ranking that has a significantly higher pairwise accuracy than PageRank alone. A qualitative evaluation of the top documents shows that fRank is less technology-biased than PageRank; by using popularity data, it is biased toward pages that Web users, rather than Web authors, visit. The machine learning component of fRank gives it the additional benefit of being more robust against spammers, and allows it to leverage further developments in the machine learning community in areas such as adversarial classification. We have only begun to explore the options, and believe that significant strides can be made in the area of static ranking by further experimentation with additional features, other machine learning techniques, and additional sources of data. 59
  • 60. 15. ACKNOWLEDGEMENT Thank you to Marc Najork for providing us with additional PageRank computations and to Timo Burkard for assistance with the popularity data. Many thanks to Chris Burges for providing code and significant support in using training RankNets. Also, we thank Susan Dumais and Nick Craswell for their edits and suggestions. 60
  • 61. 16. BIBLIOGRAPHY ➢ http://drona.csa.iisc.ernet.in/~shivani/Events/SDM-10-Tutorial/sdm10-tutorial.pdf ➢ www.burakkanber.com/blog ➢ http://www.stat.uchicago.edu/~lekheng/meetings/mathofranking/slides/ agarwal.pdf APPENDIX A 1. DATA SET 61 Table of Contents Page Values 1. Analyzed pages 4.25 2. Page Optimization 3.4 3. URL Optimization 3 4.Website Analytics 3 5. Ranking Monitor 2.7 6.Keywords 2.7 7.Link Manager 2.2 8. Search Engine Compatibility 2.1 9.Website Audit 1.9 10. Link Building 1.86 11.Website Submission 1.8 12. Site Optimization 1.7 13.Client reports 1.7 14. Social Optimization 1.4 15.Inbox Monitor 0.8 16.Competitors spy 0.8
  • 62. 62