5. Slide from LIS 544 IMT 542 INSC 544 by Jeff Huang lazyjeff@uw.edu and Shawn
Walker stw3@uw.edu
The document with the highest proportion of terms which are part of the query is
most relevant
• Documents containing more of the term(s) scored higher
• Longer documents discounted
• Rare terms weighted higher
5
7. Hilltop was one of the first to introduce the concept of machine-mediated “authority” to
combat the human manipulation of results for commercial gain (using link blast services, viral
distribution of misleading links. It is used by all of the search engines in some way, shape or
form.
Hilltop is:
•Performed on a small subset of the corpus that best represents nature of the whole
•Authorities: have lots of unaffiliated expert document on the same subject pointing to them
•Pages are ranked according to the number of non-affiliated “experts” point to it – i.e. not in
the same site or directory
•Affiliation is transitive [if A=B and B=C then A=C]
The beauty of Hilltop is that unlike PageRank, it is query-specific and reinforces the
relationship between the authority and the user’s query. You don’t have to be big or have a
thousand links from auto parts sites to be an “authority.” Google’s 2003 Florida update,
rumored to contain Hilltop reasoning, resulted in a lot of sites with extraneous links fall from
their previously lofty placements as a result.
Google artificially inflates the placement of results from Wikipedia because it perceives
Wikipedia as an authoritative resources due to social mediation and commercial agnosticism.
Wikipedia is not infallible. However, someone finding it in the “most relevant” top results will
certainly see it as so.
8. Computes PR based on a set of representational topics [augments PR with content analysis]
Topic derived from the Open Source directory
Uses a set of ranking vectors: Pre-query selection of topics + at-query comparison of the
similarity of query to topics
8
9. Pew Internet Trust Study of Search engine behavior
http://www.pewinternet.org/Reports/2012/Search-Engine-Use-2012/Summary-of-findings.aspx
Moreover, users report generally good outcomes and relatively high confidence in the capabilities of
search engines:
• 91% of search engine users say they always or most of the time find the information they are
seeking when they use search engines
• 73% of search engine users say that most or all the information they find as they use search
engines is accurate and trustworthy
• 66% of search engine users say search engines are a fair and unbiased source of information
• 55% of search engine users say that, in their experience, the quality of search results is getting
better over time, while just 4% say it has gotten worse
• 52% of search engine users say search engine results have gotten more relevant and useful over
time, while just 7% report that results have gotten less relevant
Using the Internet: Skill Related Problems in User Online Behavior; van Deursen & van Dijk; 2009
56% constructed poor queries
55% selected irrelevant results 1 or more times
38% overwhelmed by amount of information in results
34% found critical information missing from results
9