SlideShare a Scribd company logo
1 of 41
SIMS 202
Information Organization
      and Retrieval


Prof. Marti Hearst and Prof. Ray Larson
           UC Berkeley SIMS
       Tues/Thurs 9:30-11:00am
               Fall 2000



      Uploaded by: CarAutoDriver
Last Time
Web Search
– Directories vs. Search engines
– How web search differs from other search
   » Type of data searched over
   » Type of searches done
   » Type of searchers doing search
– Web queries are short
   » This probably means people are often using search
     engines to find starting points
   » Once at a useful site, they must follow links or use
     site search
– Web search ranking combines many features
What about Ranking?
Lots of variation here
– Pretty messy in many cases
– Details usually proprietary and fluctuating
Combining subsets of:
–   Term frequencies
–   Term proximities
–   Term position (title, top of page, etc)
–   Term characteristics (boldface, capitalized, etc)
–   Link analysis information
–   Category information
–   Popularity information
Most use a variant of vector space ranking to
combine these
Here’s how it might work:
– Make a vector of weights for each feature
– Multiply this by the counts for each feature
From description of the NorthernLight search engine, by Mark Krellenstein
http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
High-Precision Ranking

Proximity search can help get high-
precision results if > 1 term
– Hearst ’96 paper:
  » Combine Boolean and passage-level proximity
  » Proves significant improvements when
    retrieving top 5, 10, 20, 30 documents
  » Results reproduced by Mitra et al. 98
  » Google uses something similar
Boolean Formulations, Hearst 96



Results
Spam

Email Spam:
– Undesired content
Web Spam:
– Content is disguised as something it is
  not, in order to
  » Be retrieved more often than it otherwise
    would
  » Be retrieved in contexts that it otherwise
    would not be retrieved in
Web Spam
What are the types of Web spam?
– Add extra terms to get a higher ranking
   » Repeat “cars” thousands of times
– Add irrelevant terms to get more hits
   » Put a dictionary in the comments field
   » Put extra terms in the same color as the background
     of the web page
– Add irrelevant terms to get different types of
  hits
   » Put “sex” in the title field in sites that are selling
     cars
– Add irrelevant links to boost your link analysis
  ranking
There is a constant “arms race” between
web search companies and spammers
Commercial Issues
General internet search is often
commercially driven
– Commercial sector sometimes hides things –
  harder to track than research
– On the other hand, most CTOs for search
  engine companies used to be researchers, and
  so help us out
– Commercial search engine information changes
  monthly
– Sometimes motivations are commercial rather
  than technical
   » Goto.com uses payments to determine ranking order
   » iwon.com gives out prizes
Web Search Architecture
Web Search Architecture

Preprocessing
– Collection gathering phase
  » Web crawling
– Collection indexing phase
Online
– Query servers
– This part not talked about in the
  readings
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Standard Web Search Engine Architecture
                          Check for duplicates,
        crawl the              store the
          web                 documents
                                              DocIds



                                                       create an
 user                                                   inverted
query                                                     index



                                    Search
                                                       Inverted
           Show results             engine
             To user                                     index
                                    servers
More detailed
architecture,
from Brin & Page
98.

Only covers the
preprocessing in
detail, not the
query serving.
Inverted Indexes for Web Search Engines

Inverted indexes are still used, even
though the web is so huge
Some systems partition the indexes across
different machines; each machine handles
different parts of the data
Other systems duplicate the data across
many machines; queries are distributed
among the machines
Most do a combination of these
In this example, the data
for the pages is
partitioned across
machines. Additionally,
each partition is allocated
multiple machines to
handle the queries.

Each row can handle 120
queries per second

Each column can handle
7M pages

To handle more queries,
add another row.




                 From description of the FAST search engine, by Knut Risvik
            http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Cascading Allocation of CPUs
A variation on this that produces a
cost-savings:
– Put high-quality/common pages on many
  machines
– Put lower quality/less common pages on
  fewer machines
– Query goes to high quality machines
  first
– If no hits found there, go to other
  machines
Web Crawlers

How do the web search engines get all
of the items they index?
Main idea:
–   Start with known sites
–   Record information for these sites
–   Follow the links from each site
–   Record information found at new sites
–   Repeat
Web Crawlers
How do the web search engines get all of
the items they index?
More precisely:
– Put a set of known sites on a queue
– Repeat the following until the queue is empty:
   » Take the first page off of the queue
   » If this page has not yet been processed:
        Record the information found on this page
          – Positions of words, links going out, etc
        Add each link on the current page to the queue
        Record that this page has been processed
In what order should the links be followed?
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html




                       Structure to be traversed
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
 http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html




                                       Breadth-first search
                                       (must be in presentation mode to see this animation)
Page Visit Order
       Animated examples of breadth-first vs depth-first search on trees:
         http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html




Depth-first search
(must be in presentation mode to see this animation)
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
 http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Depth-First Crawling
            (more complex – graphs & sites)
                                                               Site       Page
                                                                      1       1
                                                                      1       2
  Page 1                                                              1       4
             Site 1                   Page 1     Site 2               1       6
                                                                      1       3
                                                                      1       5
                                                                      3       1
           Page 3            Page 2                                   5       1
                                                  Page 3
Page 2                                                                6
                                                                      5
                                                                              1
                                                                              2
                                                                      2       1
                                                                      2       2
                    Page 5        Page 1                              2       3

 Page 4
                                         Site 5       Page 1

           Page 6        Page 1         Page 2        Site 6
                         Site 3
Breadth First Crawling
             (more complex – graphs & sites)
                                                               Site Page
                                                                  1    1
  Page 1                                                          2    1
             Site 1                   Page 1     Site 2           1    2
                                                                  1    6
                                                                  1    3
           Page 3            Page 2                               2    2
                                                  Page 3
Page 2                                                            2    3
                                                                  1    4
                                                                  3    1
                                                                  1    5
                    Page 5        Page 1                          5    1
 Page 4                                                           5    2
                                         Site 5       Page 1      6    1

           Page 6        Page 1         Page 2        Site 6
                         Site 3
Web Crawling Issues
Keep out signs
– A file called norobots.txt tells the crawler which
  directories are off limits
Freshness
– Figure out which pages change often
– Recrawl these often
Duplicates, virtual hosts, etc
– Convert page contents with a hash function
– Compare new pages to the hash table
Lots of problems
–   Server unavailable
–   Incorrect html
–   Missing links
–   Infinite loops
Web crawling is difficult to do robustly!
Cha-Cha

Cha-cha searches an intranet
– Sites associated with an organization
Instead of hand-edited categories
– Computes shortest path from the root
  for each hit
– Organizes search results according to
  which subdomain the pages are found in
Cha-Cha Web Crawling Algorithm
Start with a list of servers to crawl
– for UCB, simply start with www.berkeley.edu
Restrict crawl to certain domain(s)
– *.berkeley.edu
Obey No Robots standard
Follow hyperlinks only
– do not read local filesystems
   » links are placed on a queue
   » traversal is breadth-first
See first lecture or the technical papers for
more information
Summary
Web search differs from traditional IR
systems
– Different kind of collection
– Different kinds of users/queries
– Different economic motivations
Ranking combines many features in a
difficult-to-specify manner
– Link analysis and proximity of terms seems
  especially important
– This is in contrast to the term-frequency
  orientation of standard search
   » Why?
Summary (cont.)

Web search engine archicture
– Similar in many ways to standard IR
– Indexes usually duplicated across
  machines to handle many queries quickly
Web crawling
– Used to create the collection
– Can be guided by quality metrics
– Is very difficult to do robustly
Web Search Statistics
Searches
 per Day



Info missing
For fast.com,
Excite,
Northernlight,
etc.




                 Information from searchenginewatch.com
Web
Search
Engine
 Visits




          Information from searchenginewatch.com
Percentage
of web users
who visit the
 site shown




                Information from searchenginewatch.com
Search
Engine
 Size
 (July
2000)




         Information from searchenginewatch.com
Does size
 matter?
You can’t
 access
many hits
anyhow.




            Information from searchenginewatch.com
Increasing
numbers of
 indexed
pages, self-
 reported




               Information from searchenginewatch.com
Increasing
numbers of
 indexed
   pages
   (more
  recent)
    self-
 reported




             Information from searchenginewatch.com
Web
Coverage




           Information from searchenginewatch.com
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Directory
  sizes




            Information from searchenginewatch.com

More Related Content

Similar to Information organization

ISWC GoodRelations Tutorial Part 1
ISWC GoodRelations Tutorial Part 1ISWC GoodRelations Tutorial Part 1
ISWC GoodRelations Tutorial Part 1Martin Hepp
 
GoodRelations Tutorial Part 1
GoodRelations Tutorial Part 1GoodRelations Tutorial Part 1
GoodRelations Tutorial Part 1guestecacad2
 
A Lap Around Internet Explorer 8
A Lap Around Internet Explorer 8A Lap Around Internet Explorer 8
A Lap Around Internet Explorer 8rsnarayanan
 
Technical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam AudetteTechnical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam AudetteAdam Audette
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideAdam Audette
 
Search marketing workshop 11 aug12 by communicate2
Search marketing workshop 11 aug12 by communicate2Search marketing workshop 11 aug12 by communicate2
Search marketing workshop 11 aug12 by communicate2tiemumbai
 
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013Adam Audette
 
SEO - How does it work, Why is it important, and why do we have to do it?
SEO - How does it work, Why is it important, and why do we have to do it?SEO - How does it work, Why is it important, and why do we have to do it?
SEO - How does it work, Why is it important, and why do we have to do it?Joao da Costa
 
IRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web SpiderIRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web SpiderIRJET Journal
 
Deep Comparison Shopping
Deep Comparison ShoppingDeep Comparison Shopping
Deep Comparison ShoppingMartin Hepp
 
SMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of PaginationSMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of PaginationLily Ray
 
Fried toronto sps14 91 wcm intranet
Fried toronto sps14 91 wcm intranetFried toronto sps14 91 wcm intranet
Fried toronto sps14 91 wcm intranetJeff Fried
 
Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong Stanley Fok
 
Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong Stanley Ng
 
Changhao jiang facebook
Changhao jiang facebookChanghao jiang facebook
Changhao jiang facebookzipeng zhang
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databasessjwoodman
 

Similar to Information organization (20)

ISWC GoodRelations Tutorial Part 1
ISWC GoodRelations Tutorial Part 1ISWC GoodRelations Tutorial Part 1
ISWC GoodRelations Tutorial Part 1
 
GoodRelations Tutorial Part 1
GoodRelations Tutorial Part 1GoodRelations Tutorial Part 1
GoodRelations Tutorial Part 1
 
A Lap Around Internet Explorer 8
A Lap Around Internet Explorer 8A Lap Around Internet Explorer 8
A Lap Around Internet Explorer 8
 
Technical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam AudetteTechnical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam Audette
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive Guide
 
Search marketing workshop 11 aug12 by communicate2
Search marketing workshop 11 aug12 by communicate2Search marketing workshop 11 aug12 by communicate2
Search marketing workshop 11 aug12 by communicate2
 
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013
 
Thesecrets
ThesecretsThesecrets
Thesecrets
 
SEO - How does it work, Why is it important, and why do we have to do it?
SEO - How does it work, Why is it important, and why do we have to do it?SEO - How does it work, Why is it important, and why do we have to do it?
SEO - How does it work, Why is it important, and why do we have to do it?
 
IRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web SpiderIRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web Spider
 
Deep Comparison Shopping
Deep Comparison ShoppingDeep Comparison Shopping
Deep Comparison Shopping
 
SMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of PaginationSMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of Pagination
 
Fried toronto sps14 91 wcm intranet
Fried toronto sps14 91 wcm intranetFried toronto sps14 91 wcm intranet
Fried toronto sps14 91 wcm intranet
 
Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong
 
Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong
 
Seo analysis of jabong.com at Pravin K Gupta
Seo analysis of jabong.com at Pravin K GuptaSeo analysis of jabong.com at Pravin K Gupta
Seo analysis of jabong.com at Pravin K Gupta
 
Changhao jiang facebook
Changhao jiang facebookChanghao jiang facebook
Changhao jiang facebook
 
Web design
Web designWeb design
Web design
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 

More from Stefanos Anastasiadis (15)

Webmaster guide-en
Webmaster guide-enWebmaster guide-en
Webmaster guide-en
 
Web design ing
Web design ingWeb design ing
Web design ing
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Ultra search
Ultra searchUltra search
Ultra search
 
Tips and technics for search engine market
Tips and technics for search engine marketTips and technics for search engine market
Tips and technics for search engine market
 
The little-joomla-seo-book-v1
The little-joomla-seo-book-v1The little-joomla-seo-book-v1
The little-joomla-seo-book-v1
 
The google best_practices_guide
The google best_practices_guideThe google best_practices_guide
The google best_practices_guide
 
Web search algorithms and user interfaces
Web search algorithms and user interfacesWeb search algorithms and user interfaces
Web search algorithms and user interfaces
 
Searching the web general
Searching the web generalSearching the web general
Searching the web general
 
Integration visualization
Integration visualizationIntegration visualization
Integration visualization
 
Seminar algorithms of web
Seminar algorithms of webSeminar algorithms of web
Seminar algorithms of web
 
Search engines
Search enginesSearch engines
Search engines
 
Get your-web-site-to-be-found
Get your-web-site-to-be-foundGet your-web-site-to-be-found
Get your-web-site-to-be-found
 
Search engine strategies 8 04
Search engine strategies 8 04Search engine strategies 8 04
Search engine strategies 8 04
 
Ecommerce webinar-oct-2010
Ecommerce webinar-oct-2010Ecommerce webinar-oct-2010
Ecommerce webinar-oct-2010
 

Information organization

  • 1. SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000 Uploaded by: CarAutoDriver
  • 2. Last Time Web Search – Directories vs. Search engines – How web search differs from other search » Type of data searched over » Type of searches done » Type of searchers doing search – Web queries are short » This probably means people are often using search engines to find starting points » Once at a useful site, they must follow links or use site search – Web search ranking combines many features
  • 3. What about Ranking? Lots of variation here – Pretty messy in many cases – Details usually proprietary and fluctuating Combining subsets of: – Term frequencies – Term proximities – Term position (title, top of page, etc) – Term characteristics (boldface, capitalized, etc) – Link analysis information – Category information – Popularity information Most use a variant of vector space ranking to combine these Here’s how it might work: – Make a vector of weights for each feature – Multiply this by the counts for each feature
  • 4. From description of the NorthernLight search engine, by Mark Krellenstein http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
  • 5. High-Precision Ranking Proximity search can help get high- precision results if > 1 term – Hearst ’96 paper: » Combine Boolean and passage-level proximity » Proves significant improvements when retrieving top 5, 10, 20, 30 documents » Results reproduced by Mitra et al. 98 » Google uses something similar
  • 7. Spam Email Spam: – Undesired content Web Spam: – Content is disguised as something it is not, in order to » Be retrieved more often than it otherwise would » Be retrieved in contexts that it otherwise would not be retrieved in
  • 8. Web Spam What are the types of Web spam? – Add extra terms to get a higher ranking » Repeat “cars” thousands of times – Add irrelevant terms to get more hits » Put a dictionary in the comments field » Put extra terms in the same color as the background of the web page – Add irrelevant terms to get different types of hits » Put “sex” in the title field in sites that are selling cars – Add irrelevant links to boost your link analysis ranking There is a constant “arms race” between web search companies and spammers
  • 9. Commercial Issues General internet search is often commercially driven – Commercial sector sometimes hides things – harder to track than research – On the other hand, most CTOs for search engine companies used to be researchers, and so help us out – Commercial search engine information changes monthly – Sometimes motivations are commercial rather than technical » Goto.com uses payments to determine ranking order » iwon.com gives out prizes
  • 11. Web Search Architecture Preprocessing – Collection gathering phase » Web crawling – Collection indexing phase Online – Query servers – This part not talked about in the readings
  • 12. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 13. Standard Web Search Engine Architecture Check for duplicates, crawl the store the web documents DocIds create an user inverted query index Search Inverted Show results engine To user index servers
  • 14. More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.
  • 15. Inverted Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge Some systems partition the indexes across different machines; each machine handles different parts of the data Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these
  • 16. In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 17. Cascading Allocation of CPUs A variation on this that produces a cost-savings: – Put high-quality/common pages on many machines – Put lower quality/less common pages on fewer machines – Query goes to high quality machines first – If no hits found there, go to other machines
  • 18. Web Crawlers How do the web search engines get all of the items they index? Main idea: – Start with known sites – Record information for these sites – Follow the links from each site – Record information found at new sites – Repeat
  • 19. Web Crawlers How do the web search engines get all of the items they index? More precisely: – Put a set of known sites on a queue – Repeat the following until the queue is empty: » Take the first page off of the queue » If this page has not yet been processed: Record the information found on this page – Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed In what order should the links be followed?
  • 20. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Structure to be traversed
  • 21. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Breadth-first search (must be in presentation mode to see this animation)
  • 22. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Depth-first search (must be in presentation mode to see this animation)
  • 23. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
  • 24. Depth-First Crawling (more complex – graphs & sites) Site Page 1 1 1 2 Page 1 1 4 Site 1 Page 1 Site 2 1 6 1 3 1 5 3 1 Page 3 Page 2 5 1 Page 3 Page 2 6 5 1 2 2 1 2 2 Page 5 Page 1 2 3 Page 4 Site 5 Page 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 25. Breadth First Crawling (more complex – graphs & sites) Site Page 1 1 Page 1 2 1 Site 1 Page 1 Site 2 1 2 1 6 1 3 Page 3 Page 2 2 2 Page 3 Page 2 2 3 1 4 3 1 1 5 Page 5 Page 1 5 1 Page 4 5 2 Site 5 Page 1 6 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 26. Web Crawling Issues Keep out signs – A file called norobots.txt tells the crawler which directories are off limits Freshness – Figure out which pages change often – Recrawl these often Duplicates, virtual hosts, etc – Convert page contents with a hash function – Compare new pages to the hash table Lots of problems – Server unavailable – Incorrect html – Missing links – Infinite loops Web crawling is difficult to do robustly!
  • 27. Cha-Cha Cha-cha searches an intranet – Sites associated with an organization Instead of hand-edited categories – Computes shortest path from the root for each hit – Organizes search results according to which subdomain the pages are found in
  • 28. Cha-Cha Web Crawling Algorithm Start with a list of servers to crawl – for UCB, simply start with www.berkeley.edu Restrict crawl to certain domain(s) – *.berkeley.edu Obey No Robots standard Follow hyperlinks only – do not read local filesystems » links are placed on a queue » traversal is breadth-first See first lecture or the technical papers for more information
  • 29. Summary Web search differs from traditional IR systems – Different kind of collection – Different kinds of users/queries – Different economic motivations Ranking combines many features in a difficult-to-specify manner – Link analysis and proximity of terms seems especially important – This is in contrast to the term-frequency orientation of standard search » Why?
  • 30. Summary (cont.) Web search engine archicture – Similar in many ways to standard IR – Indexes usually duplicated across machines to handle many queries quickly Web crawling – Used to create the collection – Can be guided by quality metrics – Is very difficult to do robustly
  • 32. Searches per Day Info missing For fast.com, Excite, Northernlight, etc. Information from searchenginewatch.com
  • 33. Web Search Engine Visits Information from searchenginewatch.com
  • 34. Percentage of web users who visit the site shown Information from searchenginewatch.com
  • 35. Search Engine Size (July 2000) Information from searchenginewatch.com
  • 36. Does size matter? You can’t access many hits anyhow. Information from searchenginewatch.com
  • 37. Increasing numbers of indexed pages, self- reported Information from searchenginewatch.com
  • 38. Increasing numbers of indexed pages (more recent) self- reported Information from searchenginewatch.com
  • 39. Web Coverage Information from searchenginewatch.com
  • 40. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 41. Directory sizes Information from searchenginewatch.com