This document describes a system called UProRevs that aims to personalize web search results based on a user's profile and interests.
The system works as a filter that takes the results from a normal search engine like Google and re-ranks them based on their relevance to the user's profile. It generates user profiles based on information provided during registration, and updates the profiles over time based on the user's feedback on search results.
The system calculates relevance scores for search results by comparing the keywords in each web page to those in the user's profile. Results are displayed along with their relevance scores. As the user provides feedback, their profile is updated, allowing the system to continuously improve the personalization of search
2. may be applied to any collection of entities with relies on different aspects of the user. On the other
reciprocal quotations and references. The numerical hand, in ephemeral preferences, the information used
weight that it assigns to any given element E is also to construct each user profile is only gathered during
called the Page Rank of E and denoted by PR (E). the current session, and it is immediately exploited
Page Rank was developed at Stanford University by for executing some adaptive process aimed at
Larry Page (hence the name Page-Rank [Vise and personalizing the current interaction.
Malseed, 2005] and Sergey Brin as part of a research
project about a new kind of search engine. 2.6. Profile Construction based on Modified
Collaborative Filtering:
2.3. Personalised Page Rank Algorithm In this approach, we propose the following two
The framework of a traditional personalised methods: (1) user profile construction based on the
search engine which consists of two phases: data static number of users in the neighbourhood, and (2)
collection phase and data ranking phase. In the first user profile construction based on dynamic number
phase we basically capture user interests from his of users in the neighbourhood.
browsing behaviour. This is done by extracting the
keywords of WebPages visited by the user.
3. System Architecture
Due to the large number of keywords per page
each keyword is assigned a weight depending on its The UProRevs system architecture is as shown in
frequency. Then the keywords are classified into m figure 1.
global categories. Then we implement the
Personalised Page Rank (PPR) algorithm to assign
ranks to the search list. The Personalised Page Rank
Algorithm has 4 stages [1]:
i. In this stage the webpage’s are clustered into the
global categories and larger weights are assigned
to those web pages whose category is similar to
user interests.
ii. In the 2nd stage webpage’s are further ranked
based on whether their categories satisfy user
interests and the query submitted.
iii. In the 3rd stage the algorithm checks for a user’s
changing interests.
iv. In the final stage, the search results undergo
collaborative filtering based on recommendations
by other users having similar interests.
2.4. User Profile Construction without User’s
Effort
This work deals with the construction of user
profiles without any effort from the user. The profile
generated is used to filter the search results. The
profiles are constructed in two ways [2]:
2.5. Profile Construction based on Pure
Browsing
In this method, it assumes that the preferences of
each user consist of the following two aspects: (1) 3.1. Description of System Architecture
persistent (or long term) preferences and (2)
ephemeral (or short term) preferences. In persistent Core- It acts as an interface between the user, the
preferences, the user profile is incrementally search engine and the UProRevs System. It accepts
developed over time and it is stored for use in later the query from the user and fires it to the remote
sessions. The information exploited for constructing search engine. On receiving the results it interacts
the profile usually comes from various sources, so it with the UproRevs system and displays the results to
the user along with its relevance.
272
266
3. Profile Generator- The Profile Generator develops 6. The results are displayed along with its relevance.
the profile which is basically a large set of keywords 7. User gives feedback for the webpage he visits.
and its frequencies which are related to the user’s 8. User profile is updated by the profile updater
profile. based on feedback given.
Profile Updater- The Profile Updater updates the
users profile based on the feedback given by the user.
5. System’s Mechanism
Relevance Calculator- The Relevance Calculator The goal of our project is to apply our filter to the
calculates the relevance of a particular URL with results given by a normal search engine like Google
respect to his/her profile. The output of the and get the same quality results as Google but along
Relevance Calculator is the URL along with its
with its relevance to the user’s profile.
relevance. This output is given to the core which
displays the results along with its relevance. The
output is also stored in the Relevance Log for future 5.1. Relevance of Search Results
reference. The results that are displayed for a particular
Search Engine- It is an engine which provides a set query should be relevant to the user’s profile.
of results when the query is fired by the Core. The Relevancy is a relative term and hence the user’s
results provided are simple and sorted as per the registered information and feedback would decide on
ranking algorithm of the search engine. the amount of relevancy incurred by the webpage for
a particular query.
User Information Database- The User Information
Database stores the information that the user has 5.2. Calculation of Relevancy
entered during registration. This information forms The user profile will basically be a large set of
the base of the user profile i.e. the initial user profile keywords along with a frequency value. These
is created using the information in this particular keywords will be sorted in decreasing order and be
database. given a rank based on the frequency of the keyword.
User Profiles- The User Profiles Database stores the Table 1.
profiles of all the users that are generated by the
Profile Generator. This database basically consists of Keywords (kui) Frequency(fui) Rank(ri)
a set of keywords that are related to the profile of the
user along with a rank which specifies the relevance ku0 fu0 r0
of the keyword to the user’s profile. The profile ku1 fu1 r1
updater updates this database whenever the user
ku2 fu2 r2
gives feedback about a particular webpage.
| | |
Feedback Log- The feedback log consists of a log of
the URLs that the user has visited and the feedback | | |
given by the user. This feedback given by the user is
kuN fuN rN
used by the Profile Updater to update the user’s
profile every time he/she gives feedback about a
particular webpage.
The User Profile can be represented as in Table-
Relevance Log- The Relevance Log stores the URLs 1.When the user enters a query the core fires this
and its relevance. The relevance calculator before query to the search engine which returns the results
calculating the relevance for a particular URL in the form of a list of URLs that are sorted with
consults the Relevance Log to check whether the respect to the Page Rank Algorithm of the remote
relevance for this particular URL has been search engine. Keywords are extracted from the
calculated. webpage’s specified in the result. A dynamic
Webpage Profile is created which contains the
keywords on that particular webpage along with its
4. System’s Flow frequency. The keywords in the Webpage Profile are
also ranked based on their frequency. The Webpage
1. User registers by giving his personal and
Profile can be represented as in Table 2.
professional details.
2. A profile is generated by the profile generator Thus the mechanics of the UProRevs system are
based on the details given by the user. initialised by developing the User Profile and
3. User enters search query. Webpage Profile. After generating the two profiles
4. Search Engine provides current results. the system compares the two profiles to calculate the
5. Relevance of the results with the user’s profile is Relevance of the page. After calculating the
calculated. relevance the results are displayed along with their
273
267
4. relevance to the user’s profile. Wi = 1 for di = 0
= 1/disi for di ≠ 0
Table 2.
Where di =| ri − si |
Keywords(kwi) Frequency(fwi) Rank(si) Wi = Weight of the keyword with respect to the
profile.
kw0 fw0 s0 si = Rank of the keyword Ki in the Webpage Profile.
kw1 fw1 s1 ri = Rank of the keyword Ki in the User Profile.
kw2 fw2 s2 Thus if there exists no difference in the ranks of
| | | the keyword in the User Profile and Webpage Profile
then it get a weight 1, which is the highest weight for
| | | any particular keyword. When there exists a
kwn fwn sn difference in the ranks, the weight decreases with the
increase in the rank difference and decrease in the
frequency of the keyword in the webpage.
Once the results are displayed the user will check
the relevance and select an appropriate URL to view. Further to calculate the weight of the entire page
On viewing and analysing the webpage the user will we take a summation of the weights of each keyword
be in a position to say how relevant the webpage is to in the webpage that exists in the User Profile. Thus
his/her profile and will accordingly give his / her the weight of the webpage W can be calculated as
feedback to the webpage. Once the feedback is given follows:
the user’s profile gets updated based on the rating of W= Σni=1 Wi
the feedback that has been given by the user. where
Wi = Weight of the keyword Ki.
6. Evaluation Metrics n = Number of keywords in the webpage.
Now as we know that the maximum weight a
6.1. Relevance Calculation keyword can get is 1, the maximum weight that a
As described in the previous section we have two webpage can get will be ‘n’ (number of keywords on
profiles which consist of keywords and their the webpage).Thus Relevance ρ of the webpage can
corresponding ranks. The relevance in the UProRevs be calculated from the weight of the webpage as
system is determined by calculating the weight of a follows,
particular webpage with respect to the User Profile. ρ = (W/n)*100
The weight of the webpage is calculated by first where
calculating the weight of each keyword in webpage W = Weight of the webpage.
that exists in the user profile. n = Number of keywords in the Webpage Profile.
The factors to be considered while calculating the Thus Relevance ρ can be defined as the ratio of
weight of each keyword are as follows: the weight of the webpage to the maximum weight
that the particular page can be assigned.
i. Difference in the ranks of the keyword in the
User Profile and Webpage Profile - Lesser the 6.2. Profile Updation
difference in the ranks of the keyword, more
When a user visits a website and gives his
relevant is it to the User’s Profile and so higher
feedback f specifying that the particular website is
weight needs to be given to that particular
f% relevant to his profile and query, his/her profile
keyword. Greater the difference in the ranks of
gets updated as per the value of f given by the user.
the keyword, lesser relevant is the keyword to the
User’s Profile and so lower weight needs to be The Webpage profile consists of keywords and
given to that particular keyword. their respective frequencies. The User Profile is
updated by updating the frequency of those keywords
ii. Rank of the keyword in the Webpage Profile – in the User Profile that exit in the Webpage Profile.
Higher the rank of the keyword in the Webpage The updating of keywords depends on the following
Profile, higher weight it gets and vice versa. This factors:
factor takes into consideration the instance when
the rank differences between any two keywords i. The feedback rating given by the user for the
are same but have different frequencies of webpage.
existence on the webpage. ii. The frequency of the keyword in the webpage.
Considering these two factors the weight of each Based on the following factors the frequency of the
keyword Ki can be calculated as follows keyword can be updated using the following formula
274
268
5. nf = cf (1+r/100)n 7.1. Spearman's Rank Correlation
where
In statistics, Spearman’s rank correlation
cf = current frequency of the keyword. coefficient, named after Charles Spearman and often
nf = new frequency of the keyword. denoted by the Greek letter ρ (rho), is a non-
r = f/100, feedback rating. parametric measure of correlation that is, it assesses
n= frequency of the keyword in the webpage. how well an arbitrary monotonic function could
describe the relationship between two variables,
without making any assumptions about the frequency
distribution of the variables.
7. Test Cases
In principle, ρ is simply a special case of the
The UProRevs concept was implemented under Pearson product-moment coefficient in which the
the name of the Personalized Search Engine data are converted to rankings before calculating the
BINGbeta. The test cases comprise of testing the coefficient. In practice, however, a simpler procedure
system against the query python for the two user is normally used to calculate ρ. The raw scores are
profiles of a Programmer and Zoologist. Thus for a converted to ranks, and the differences ‘d’ between
Programmer the system should give higher relevance the ranks of each observation on the two variables
for URL’s related to the programming language are calculated. ρ is then given by:
’python’. Whereas for a Zoologist the URL’s that are ρ = 1 – [6 Σd i2/n(n2-1)]
related to the snake ’python’ must get a higher
relevance. di = the deference between each rank of
corresponding values of p and z.
The result set consists of a set of URL's that we n = the number of pairs of values.
get after firing the query 'python' for a programmer Now, the above formula would give correlation ρ
and a zoologist. It is to be noted that the URL's between the two sets of relevance. The value of ρ
selected to be a part of the result set either relate to would range from -1 to +1. If the value of ρ is closer
the programmer or the zoologists. At the same time to -1 or +1 then more closely are the two variables
results that had lesser content on their page were also related. If ρ is closer to 0 it means there is no relation
discarded. If there exists a clear distinction between between the variables. Thus to prove that the system
the relevance for a programmer and a zoologist, it is distinguishing between the two profiles we need to
would prove that the system is indeed recognizing a get the values of ρ close to 0.
distinction between a Programmer and a Zoologist.
Table 3 shows the structure of a result set. To test the UproRevs system we considered 3
result sets. Result Set 1 reflects the state of the
The result set consists of 5 columns which system when the two profiles were just created.
represent the URL, relevance for the programmer, Result Set 2 reflects the state of the system after a
relevance for the zoologist, rank for the relevance of few iterations and Result Set 3 reflects the state of
the programmer and rank for the relevance of the the system after further iterations. The ρ values for
zoologist. The distinction between the two sets of Result Sets 1, 2 and 3 were calculated as 0.618, 0.296
relevance can be proved by Spearman's Rank and 0.15 respectively. It can be observed that the
Correlation. correlation between the two sets of values decreases
with the increase in time. Figure 3 gives the plot of
Table 3. Time versus the Correlation between the two set of
values.
URL Progra-- Zoologi- Rank Rank
mmer st (Z) (Rp) (Rz) The Figure 3 is the plot of the correlation values
(P) at different instances of time. It shows that as the
time increases, correlation approaches zero. Thus it
url 1 P1 Z1 Rp1 Rz1 becomes very clear from the graph that as the time
url 2 P2 Z2 Rp2 Rz2 increases the correlation between the two set of
values decreases. In other words the distinction
| | | | | between the two set of values increases proving that
| | | | | the system is recognising the difference between the
profile of a Programmer and a Zoologist.
url n Pn Zn Rpn Rzn
275
269
6. [2] Kazunari Sugiyama, Kenji Hatano, Masatoshi
Yoshikawa. “Adaptive Web Search Based on User
Profile Constructed without Any Effort from Users”,
Proceeding of 13th International Conference on world
wide web, New York, USA ISBN: 1-58113-844-X
Page 675-684, (2004).
[3] Ricardo Baeza-Yates, Carlos Hurtado, Marcelo
Mendoza and Georges Dupret. “Modelling User
Search Behavior”. Proceedings of the Third Latin
American Web Congress, (2005).
[4] Yuefeng Li and Ning Zhong, “Mining Ontology for
Automatically Acquiring Web User Information
Needs”, IEEE Transaction on Knowledge and Data
Engineering, Volume-14, Issue - 4, ISSN: 1041-4347,
(2006).
[5] Boris Chidlovskii, Natalie S. Glance and M.
Antonietta Grasso. “Collaborative Re-Ranking of
Search Results”. Proceeding of AAAI-2000Workshop
Figure 3: Graph of Time v/s ρ on AI for Web Search (2000).
[6] Sergey Brin and Lawrence Page. “The Anatomy of a
Large Scale Hyper textual Web Search Engine”.
8. Conclusion Proceeding of 7th International Conference on World
Wide Web, Australia, Elsevier Science Publishers,
The UProRevs system provides the user with ISSN:0169-7552, Page 107-117, (1998)
relevant search results thus saving the users valuable
[7] Ray-I Chang, Jan-Ming Ho. “Active Feedback for
time spent otherwise while using a general search Effective Web Search”, Technical Report No. TR-IIS-
engine. A few drawbacks, such as dishonest details 05-013, September 2005.
provided by the user at the time of his registration,
unfair ratings provided by the users may prove to be [8] Hsin-Chang Yang,Chung-Hong Lee. “Automatic
critical. However, emphasis must be given that the Metadata Generation for Web Pages Using a Text
UproRevs system describes the architecture of a Mining Approach”, Proceedings International
simple Personalized Search Engine. Workshop on Challenges in Web Information
Retrieval and Integration (WIRE05). April 2005.
Another point to be noted is that we have
discussed the UProRevs system as a subordinate [9] Sung-Won Jung, Hyuk-Chul Kwon. “A Scalable
system to a remote general search engine. This Hybrid Approach for Extracting Head Components
from Web Tables”. IEEE transaction on Knowledge
subordinate system can be transformed into a stand-
and Data Engineering, Volume 18, Issue 2, Page-174-
alone search engine in which the relevance would act 187, February 2006.
as a parameter to re-rank search results thus
redefining web search. [10] Sergey Brin, Rajeev Motwani and T. Winograd. “Page
Rank Citation Ranking: Bringing Order to the Web
Lawrence Page”.
9. Acknowledgment http://dbpubs.stanford.edu:8090/pub/showDoc.pdf
This work has been done in Multimedia [11] Akshay Surve, Manav Shah and Amiya Tripathy.
Laboratory of Don Bosco Institute of Technology “Optimizing Web search engine by using user
and is fully supported by Don Bosco Institute of feedback”, Proceedings of International Conference
Technology, Mumbai, India. Business and Information (BAI2007), July 2007,
Tokyo, Japan. Volume 4, ISSN-1729-9322.
10. References
[1] Wen-Chih Peng and Yu-Chin Lin. “Ranking Web
Search Results from Personalized Perspective”.
Proceedings of the IEEE joint Conference on E-
Commerce Technology (CEC’06) and Enterprises
Computing, E-Commerce and E-Service (EEE’06),
San Francisco, California, June 26-29, 2006.
276
270