More Related Content
Similar to Search Engines
Similar to Search Engines (20)
More from Ram Dutt Shukla
More from Ram Dutt Shukla (20)
Search Engines
- 1. part 1: search engines
part 2: digital libraries
© Tefko Saracevic 1
- 2. dictionary definitions
search
COMPUTING (transitive verb) to examine a computer file,
disk, database, or network for particular information
engine
something that supplies the driving force or energy to a
movement, system, or trend
search engine
a computer program that searches for particular keywords
and returns a list of documents in which they were found,
especially a commercial service that scans documents on
the Internet
© Tefko Saracevic 2
- 3. about definition of search
engines
• oh well …
search engines do not search only for
keywords, some search for other stuff as
well
• and they are really not “engines” in the
classical sense
but then mouse is not a “mouse”
© Tefko Saracevic 3
- 5. How Search Engines Work
(Sherman 2003)
Crawler
URL1
URL2
Indexer The Web
URL3 URL4
Search All About
Eggs - 90%
Engine Your
Eggs
Eggo - 81%
Database Eggs? Browser
by
Ego- 40%
Eggs. Huh? -Am
S. I. 10%
© Tefko Saracevic 5
- 6. how do search engines
work? elaboration
• crawlers, spiders: go out to find content
in various ways go through the web looking for
new & changed sites
periodic, not for each query
no search engine works in real time
some search engines do it for themselves,
others not
buy content from companies such as Inktomi
for a number of reasons crawlers do not cover
all of the web – just a fraction
what is not covered is “invisible web”
© Tefko Saracevic 6
- 7. elaboration …
• organizing content: labeling, arranging
indexing for searching – automatic
keywords and other fields
arranging by URL popularity - PageRank as Google
classifying as directory
mostly human handpicked & classified
• as a result of different organization we have
basically two kinds of search engines:
search – input is a query that is then searched & displayed
directory – classified content – a class is displayed
– and fused: directories have now also search
capabilities & vice versa
© Tefko Saracevic 7
- 8. elaboration (cont.)
• databases, caches: storing content
humongous files usually distributed over many computers
• query processor: searching, retrieval, display
takes your query as input
engines have differing rules how handled
displays ranked output
some engines also cluster output and provide visualization
• at the other end is your browser
© Tefko Saracevic 8
- 9. elaboration…
similarities, differences
• all search engines have these basic parts in
common
• BUT the actual processes – methods how
they do it – are based on various algorithms
& they differ
most are proprietary with details kept mostly
secret but based on well known principles from
information retrieval or classification
to some extent Google is an exception – they
published their method
© Tefko Saracevic 9
- 10. case of
• developed by Sergey Brin and Lawrence Page
while students at Stanford
in the beginning run on Stanford computers
• basic approach has been described in their
famous paper
“The Anatomy of a Large-Scale Hypertextual
Web Search Engine”
well written, simple language, has their pictures
in acknowledgement they cite the support by NSF’s Digital
Library Initiative i.e. initially, Google came out of
government sponsored research
describe their method PageRank - based on ranking
hyperlinks as in citation indexing
“We chose our system name, Google, because it is a
common spelling of googol, or ten on hundredth power”
© Tefko Saracevic 10
- 11. coverage differences
• no engine covers more than a fraction of WWW
estimates: none more than 16%
hard (even impossible) to discern & compare coverage, but they
differ substantially in what they cover
• in addition:
many national search engines
own coverage, orientation, governance
many specialized or domain search engines
own coverage geared to subject of interest
many comprehensive sources independent of search engines
some have compilations of evaluated web sources
© Tefko Saracevic 11
- 12. searching differences
• substantial differences among search engines
on searching, retrieval display
need to know how they work & differ in respect to
defaults in searching a query
searching of phrases, case sensitivity, categories
searching of different fields, formats, types of resources
advance search capabilities and features
possibilities for refinement, using relevance feedback
display options
personalization options
© Tefko Saracevic 12
- 13. business model differences
several business models
• public good - have independent budget
e.g. PubMed, Librarians’ Index to Internet
• earn revenue from provision of information
all commercial search engines
• using search engines to promote their other
activities
e.g. telephone directories
© Tefko Saracevic 13
- 14. sponsorship differences
• need to understand treatment of
sponsorship – they influence what they
search & how they display results
some list separately results from sponsored sites
so you are reasonably clear what is there
because it is sponsored & not
some have display-per-pay - showing first sites
that paid most & do not even tell you that
some have pay per update of sites
• imperative to find sources that explain these
models for different engines to know what is
covered & what are you are getting
© Tefko Saracevic 14
- 15. limitations
• every search engine has limitation as to
coverage
meta engines just follow coverage limitations & have more
of their own
search capabilities
finding quality information
• some have compromised search with economics
becoming little more than advertisers
• but search engines are also many times victims
of spamindexing
affecting what is included and how ranked
© Tefko Saracevic 15
- 16. spamming a search engine
• use of techniques that push rankings higher
than they belong is also called spamdexing
methods typically include textual as well as link-
based techniques
like e-mail spam, search engine spam is a form
of adversarial information retrieval
the conflicting goals of accurate results of search
providers & high positioning by content page rank
© Tefko Saracevic 16
- 17. meta search engines
• meta engines search multiple engines
getting combined results from a variety of
engines
• do not have their own databases
but have their own business models
affecting results
• a number of techniques used
interesting ones: clustering, statistical
analyses
© Tefko Saracevic 17
- 18. how to find a search engine?
• variety of resources that list or categorize
engines
• SearchEngines.com
search for engines by topic, geography, reference
Search Engine Guide
engines categorized by topic; other engine information
Search Engine Colossus
international directory of search engines by country, topic from 198
countries and 61 territories; engines in choice of languages
Phil Bradley’s country based search engines
over 2000 serach engines from countries all over the globe
© Tefko Saracevic 18
- 19. sample of meta engines
- with organized results
Dogpile
results from a number of leading search engines; gives
source, so overlap can be compared; (has also a (bad)
joke of the day)
Surfwax
gives statistics and text sources & linking to sources; for
some terms gives related terms to focus
Teoma
results with suggestions for narrowing; links resources
derived; originated at Rutgers
Turbo10
provides results in clusters; engines searched can be
edited
© Tefko Saracevic 19
- 20. meta search engines (cont.)
• Large directory
Complete Planet
directory of over 70,000 databases & specialty engines
• Results with graphical displays
Vivisimo
clusters results; innovative
Webbrain
results in tree structure – fun to use
Kartoo
results in display by topics of query
© Tefko Saracevic 20
- 21. domain engines & catalogs
• cover specific subjects & topics
• important tool for subject searches
particularly for subject specialist
valued by professional searchers
• selection mostly hand-picked rather than by
crawlers, following inclusion criteria
often not readily discernable
but content more trustworthy
© Tefko Saracevic 21
- 22. domain engines … sample
Open Directory Project
large edited catalog of the web – global, run by volunteers
BUBL LINK
selected Internet resources covering all academic subject
areas; organized by Dewey Decimal System – from UK
Profusion
search in categories for resources & search engines
Resource Discovery Network – UK
“UK's free national gateway to Internet resources for the
learning, teaching and research community”
© Tefko Saracevic 22
- 23. domain engines … sample
Think Quest – Oracle Education Foundation
• education resources, programs; web sites created by students
All Music Guide
• resource about musicians, albums, and songs
Internet Movie Database
• treasure trove of American and British movies
Genealogy links and surname search engines
well.. that is getting really specialized (and popular)
Daypop
searches the “living web” “The living web is composed of sites that upda
on a daily basis: newspapers, online magazines, and weblogs”
© Tefko Saracevic 23
- 24. science, scholarship engines …
sample free access
Psychcrawler - Amer Psychological Association
web index for psychology
Entrez PubMed – Nat Library of Medicine
biomedical literature from MEDLINE & health journals
CiteSeer - NEC Research Center
scientific literature, citations index; strong in computer
science
Scholar Google
searches for scholarly articles & resources
Infomine
scholarly internet research collections
Scirus
scientific information in journals & on the web
© Tefko Saracevic 24
- 25. science, scholarship engines …
sample commercial access
• an addition to freely accessible engines
many provide search free but access to full
text paid
by subscription or per item
RUL provides access to these & many more:
ScienceDirect
Elsevier: “world's largest electronic collection of science, technology and
medicine full text and bibliographic information”
ACM Portal
Asoc. for Computing Machinery: access to ACM Digital Library & Guide to
Computing
© Tefko Saracevic 25
- 26. where to find out?
• information about search engines in sources
that have updates, news, tips for searching
and more – a MUST for searchers :
Search Engine Watch
ratings, news, statistics, charts, explanations, tutorials
Search Engine Showdown
“The users’ guide to web searching” - run by a librarian, news
links, ratings
Virtual Chase
a site about “Teaching Legal Professionals How To Do
Research;,” this section has very good tips and links for
consideration of quality on the web
© Tefko Saracevic 26
- 27. where? ….
SiteLines
a blog, written by Rita Vine, a professional librarian, &
web search trainer; many evaluations in archive
ResourceShelf
“Resources and News for Information Professionals,”
edited by Gary Price, a librarian & author of Invisible
Web – has extensive archive
WebsearchAbout
not evaluative, but provides news, capabilities, sources,
articles about web searching
© Tefko Saracevic 27
- 30. definition
• digital libraries are viewed from several perspectives
technical: “Digital library is a managed collection of
information, with associated services, where
information is stored in digital format and accessible
over a network.” (Arms, 2000)
institutional: “Digital libraries are organizations that
provide the resources, including the specialized staff,
to select, structure, offer intellectual access to,
interpret, distribute, preserve the integrity of, and
ensure the persistence over time of collections of
digital works so that they are readily and
economically available for use by a defined
community or set of communities.” (Waters, 1998)
© Tefko Saracevic 30
- 31. a bit of context
• short but volatile history
research & development took of by start/mid 1990’s
in the next decade phenomenal growth worldwide
large investment in research & building
• number of communities involved
computer science, primarily in research
many subjects: digital libraries in their domain
library & information science: operations, studies of
users, use, usability
• number of types emerged
© Tefko Saracevic 31
- 32. libraries & digital resources
• libraries (particularly research, academic & special)
directed massive funding toward such resources
electronic journals
databases
catalogs
digitization of parts of collection
• thus becoming in effect digital libraries – or
more accurately hybrid libraries
with graphic and digital versions or types of resources
© Tefko Saracevic 32
- 33. emphasis here
• on large academic or research digital
libraries that also are related to searching
provide search capabilities or access to search
engines
provide electronic journals that provide full text
of articles after a search
• such libraries have become also search
portals of sort, essential for their users
in education, research & related activities
© Tefko Saracevic 33
- 34. sample
New York Public Library Digital
“NYPL Digital is your gateway to The New York Public Library’s rare and
unique collections in digitized form.” Includes access to searchable
databases
U California Berkeley Digital Library SUNsite
“builds digital collections and services while providing information
and support to digital library developers worldwide.
The British Library
“The world’s knowledge.” Includes “Services fro library and
information Professionals.”
Los Angeles Public Library Kids’ Path
resources for children; search through directory
© Tefko Saracevic 34
- 35. sample …
New Zealand Digital Library
searching of a number of digital collections, including humanity
development library
Research Library Group
“RLG is a not-for-profit organization of over 150 research libraries,
archives, museums, and other cultural memory institutions.”
Includes links to a number of searchable collections
Public Library of Science
“PLoS is a nonprofit organization of scientists and physicians
committed to making the world's scientific and medical literature
a public resource.” Publishes open access journals
© Tefko Saracevic 35
- 36. Rutgers libraries
– digital components
• strategic planning in developing digital access
• rich & complex content of digital resources
several hundred indexes & databases for searching
some 20,000 electronic journals
thousand & more digital reference sources
subject research guides
Searchpath & other tutorials
electronic reserve
• affected teaching, learning, research by the
whole community
© Tefko Saracevic 36
- 37. some critical issues for
searching
• no way yet to do federated searching in
digital libraries
to search several indexes at the same time
each source has to be searched separately
most have very different search features, capabilities
• finding items in indexes does not mean that
always able to get full text
• thus, searching time-consuming, chaotic
© Tefko Saracevic 37
- 38. where to find out?
• information about digital libraries
LibWeb U California, Berkeley
“lists currently over 7200 pages from libraries in over 125 countries”
Digital Library Federation
“a consortium of libraries and related agencies that are pioneering
the use of electronic-information technologies to extend their
collections and services”
D-Lib Magazine
“a solely electronic publication with a primary focus on digital library
research and development, including but not limited to new
technologies, applications, and contextual social and economic
issues”
© Tefko Saracevic 38
- 39. where? …
Ariadne (UK)
“to report on information service developments and
information networking issues worldwide, keeping the busy
practitioner abreast of current digital library initiatives”
Information Technology and Libraries
ALA publication; “related to all aspects of libraries and
information technology, including digital libraries”
Journal of Digital Information
“Publishing papers on the management, presentation and uses
of information in digital environments”
Biblio Tech Review
“Information Technology for Libraries” – monthly news and
review magazine
© Tefko Saracevic 39
- 40. in conclusion
• search engines are great but you have to
KNOW what is under the hood
as to coverage, business model, search features,
outputs …
they are NOT for every kind of information need
• digital libraries are great for searching but
you have to KNOW requirements for
searching different resources that are
included
there is no federated searching as yet, or for the
time to come
© Tefko Saracevic 40