7. The Parts of a Search Engine
Spider (or “crawler”)
Indexer
Search software (an algorithm)
8. The “spider” or “crawler”
The spider visits a web page, reads it, and then
follows links to other pages within the site. This is
what it means when someone refers to a site being
"spidered" or "crawled". This is also known as
“harvesting”. The spider returns to the site on a
regular basis, such as every month or two, to look for
changes.
10. The indexer
Everything the spider finds goes into the second part
of a search engine, the index. The index, sometimes
called the catalog, is like a giant book containing a
copy of every web page that the spider finds. If a web
page changes, then this book is updated new
information.
11. 11
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
Simple Index Diagram
13. 13
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
But It's Not
Index ahead of time
• Find files or records
• Open each one and read it
• Store each word in a searchable index
Provide search forms
• Match the query terms with words in the index
• Sort documents by relevance
Display results
14. 14
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
content
search
functionality
user
interface
Search is Mostly Invisible
Like an iceberg,
2/3 below water
15. How Search Engines Work?
1) They collect information from selected web sites
2) The employ special software robots, called spiders, to
crawl web pages
3) Spiders build lists of the words found in Web sites.
1) When a spider is building its lists, the spider is Web crawling.
4) Spiders store the lists in the engine’s database
5) The engine’s indexing software builds an index of words
6) Information is matched against query input and
retrieved (processing algorithm)
38. Traditional text-based image search engines
• Manual annotation of images
• Use text-based retrieval methods
Water lilies
Flowers in a pond
<Its biological
name>
39. QBIC – Search by color
** Images courtesy : Yong Rao