SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
Challenges in Running a Commercial
        Web Search Engine

           Amit Singhal
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
Introduction
• Crawling
  – Follow links to find information
• Indexing
  – Record what words appear where
• Ranking
  – What information is a good match to a user query?
  – What information is inherently good?
• Displaying
  – Find a good format for the information
• Serving
  – Handle queries, find pages, display results
History
• The web happened (1992)
• Mosaic/Netscape happened (1993-95)
• Crawler happened (1994): M. Mauldin
• SEs happened 1994-1996
   – InfoSeek, Lycos, Altavista, Excite, Inktomi, …
• Yahoo decided to go with a directory
• Google happened 1996-98
   – Tried selling technology to other engines
   – SEs though search was a commodity, portals were in
• Microsoft said: whatever …
Present
• Most search engines have vanished
• Google is a big player
• Yahoo decided to de-emphasize directories
   – Buys three search engines
• Microsoft realized Internet is here to stay
   – Dominates the browser market
   – Realizes search is critical
History
• Early systems Information Retrieval
  based
  – Infoseek, Altavista, …
• Information Retrieval
  –   Field started in the 1950s
  –   Primarily focused on text search
  –   Already had written-off directories (1960s)
  –   Mostly uses statistical methods to analyze text
History
• IR necessary but not sufficient for web
  search
• Doesn’t capture authority
  – Same article hosted on BBC as good as a slightly
    modified copy on john-doe-news.com
• Doesn’t address web navigation
  – Query ibm seeks www.ibm.com
  – To IR www.ibm.com may look less topical than a
    quarterly report
History

• But there are links
  – Long history in citation analysis
  – Navigational tools on the web
  – Also a sign of popularity
  – Can be thought of as recommendations (source
    recommends destination)
  – Also describe the destination: anchor text
History

• Link analysis
  – Hubs and authority (Jon Kleinberg)
     • Topical links exploited
     • Query time approach
  – PageRank (Brin and Page)
     • Computed on the entire graph
     • Query independent
     • Faster if serving lots of queries
  – Others…
History
• Google showed link analysis can make
  a huge difference and is practical too
  – Everyone else followed

• Then there is the secret sauce
  –   Link analysis
  –   Information retrieval
  –   Anchor text
  –   Other stuff
History
• Interfaces
  – Many alternatives existed/exist
     •   Simple ranked list
     •   Keywords in context snippets (Google first SE to do this)
     •   Topics/query suggestion tools (e.g. Vivisimo, Teoma)
     •   Graphical, 2-D, 3-D
  – Simple and clean preferred by users
     • Like relevance ranking
     • Like keywords in context snippets
End Product
• As of today
  – Users give a 2-4 word query
  – SE gives a relevance ranked list of web pages
  – Most users click only on the first few results
  – Few users go below the fold
     • Whatever is visible without scrolling down

  – Far fewer ask for the next 10 results
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
Oh No … This is REAL
• 80% of users use search engines to find sites
Enter the Greedy Spammer
• Users follow search results
• Money follows users, spam follows …
• There is value in getting ranked high
  – Affiliate programs
     • Siphon traffic from SEs to Amazon/eBay/…
         – Make a few bucks
     • Siphon traffic from SEs to a Viagra seller
         – Make $6 per sale
     • Siphon traffic from SEs to a porn site
         – Make $20-$40 per new member
Big Money
• Let’s do the math
• How much can the spam industry make
  by spamming search engines?
  – Assume 500M searches/day on the web
     • All search engines combined
  – Assume 5% commercially viable
     • Much more if you include porn queries
  – Assume $0.50 made per click (from 5c to $40)
  – $12.5M/day or about $4.5 Billion/year
How?
• Defeat IR
  – Keyword stuffing
  – Crawlers declare that it is a SE spider
  – They dish us an “optimized” page
But that should be easy…
• Just detect keyword density
But that is easy too…
• Just detect that page is not about query
Legitimate NLP Parse
• Noun phrase to noun phrase
But links should help…
• No one should link to these bad sites
  – Expired domains
     • The owner of a legitimate domain doesn’t renew it
     • Spammers grab it, it already has tons of incoming links
     • E.g., anchor text for
         – The War on Freedom
         – The War on Freedom:
           How and Why America
           was attacked
         – The War on Freedom
Get Links
Guestbooks
Get Links
Mailing lists
Get Links
Link Exchange
State of Affairs
• There is big money in spamming SEs
• Easy to get links from good sites
• Easy to generate search algorithm
  friendly pages
• Any technique can be and will be
  attacked by spammers
• Have to make sense out of this chaos
We counter it well
• Most SEs are still very useful
  – Used over 500 million times every day
     • All search engines put together

• Our internal measurements show that
  we are winning
• Still need to be watchful
And then…
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
Information Retrieval
• Test collection paradigm of evaluation
  –   Static collection of documents (few million)
  –   A set of queries (around 50-100)
  –   Relevance judgments
  –   Extensive judgments not possible (100x1,000,000)
  –   Use pooling
       • Pool top 1000 results from various techniques
       • Assume all possible relevant documents judged
       • Biased against revolutionary new methods
          – Judge new documents if needed
On the Web
• Collection is dynamic
  – 10-20% urls change every month
  – Spam methods are dynamic
  – Need to keep the collection recent
• Queries are also time sensitive
  – Topics are hot then not
  – Need to keep a representative sample
On the Web
• Search space is HUGE
  – Over 200 million queries a day
  – Over 100 million are unique
  – Need 2700 queries for a 5% (700 for 10%) improvement to
    be meaningful at 95% confidence
• Search space is varied
  – Serve 90 different languages
  – Can’t have a catastrophic failure in any
  – Monitoring every part of the system is non-trivial
• IR style evaluation
  – Incredibly expensive
  – Always out of date
On the Web
• But what about user behavior?
  – You can use clicks as supervision.
• Clicks
  – Incredibly noisy
  – A click on a result does not mean a vote for it
     • The destination may just be a traffic peddler
     • User taken to some other site
     • If anything, this (clicked) result was BAD
Blue and Gold Fleet
We do Very Well
• Continually evaluate our system
  – In multiple languages
  – Tests valid over large traffic
  – Caught many possible disasters
• Constantly launch changes/products
  – Stemming, Google News, Froogle, Usenet, …
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
  – Finding Needles in a 20 TB Haystack, 200M times per day
Past

1995 research project at Stanford University
Lego Disk Case

One of our earliest storage systems
Peak of google.stanford.edu
Growth
• Nov. 98: 10,000 queries on 25 computers
• Apr. 99: 500,000 queries on 300 computers
• Sept. 99: 3M queries on 2,100 computers
Servers 1999
Datacenters now



       And 3 days later…
Where the users are…
What can we learn…

•   Structure of Web
•   Interests of Users
•   Trends and Fads
•   Languages
•   Concepts
•   Relationships
Spelling Correction: Britney Spears
Google

• Ethics
  – No pay for inclusion (in index)
  – No pay for placement (in ranking)
  – Clearly demarked results and ads
  – 20% engineer time doing random stuff
    • Out came news, froogle, orkut
  – Users come first
Recent launches…
Recent launches…
Some perks…
Our Chef Charlie…
Thank You…


Amit
Singhal

Contenu connexe

Tendances

Google algorithm updates
Google algorithm updatesGoogle algorithm updates
Google algorithm updatesKavya V K
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine OptimizationKaran Thakkar
 
How Google Search Algorithm Works ??
How Google Search Algorithm Works ??How Google Search Algorithm Works ??
How Google Search Algorithm Works ??viralshahb
 
Vikram seo ppt
Vikram seo pptVikram seo ppt
Vikram seo pptvickybish
 
Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...joelmaster
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine pptmitul2712
 
Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter salomon dayan
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimizationShreyas Anand
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyprashant mishra
 
How a search engine works report
How a search engine works reportHow a search engine works report
How a search engine works reportSovan Misra
 
SEO 101 webinar 10 25-2012
SEO 101 webinar 10 25-2012SEO 101 webinar 10 25-2012
SEO 101 webinar 10 25-2012451 Marketing
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO BasicsJenifer Renjini
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search EnginesShivam Saxena
 

Tendances (20)

Google algorithm updates
Google algorithm updatesGoogle algorithm updates
Google algorithm updates
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
How Google Search Algorithm Works ??
How Google Search Algorithm Works ??How Google Search Algorithm Works ??
How Google Search Algorithm Works ??
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
 
Google
GoogleGoogle
Google
 
Lvr ppt
Lvr pptLvr ppt
Lvr ppt
 
Search engines
Search enginesSearch engines
Search engines
 
Vikram seo ppt
Vikram seo pptVikram seo ppt
Vikram seo ppt
 
Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter Basic SEO mini workshop for copywriter
Basic SEO mini workshop for copywriter
 
Google Search Tips
Google Search TipsGoogle Search Tips
Google Search Tips
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodology
 
Search Engine
Search EngineSearch Engine
Search Engine
 
How a search engine works report
How a search engine works reportHow a search engine works report
How a search engine works report
 
SEO 101 webinar 10 25-2012
SEO 101 webinar 10 25-2012SEO 101 webinar 10 25-2012
SEO 101 webinar 10 25-2012
 
Google
GoogleGoogle
Google
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search Engines
 

En vedette

Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
Information Retrieval Techniques of Google
Information Retrieval Techniques of Google Information Retrieval Techniques of Google
Information Retrieval Techniques of Google Cyr Ish
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!Jane Garay
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrievalSadaf Rafiq
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 

En vedette (9)

Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
Information Retrieval Techniques of Google
Information Retrieval Techniques of Google Information Retrieval Techniques of Google
Information Retrieval Techniques of Google
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similaire à Search Engine Google

SEO for the Semantic Web
SEO for the Semantic WebSEO for the Semantic Web
SEO for the Semantic WebMihai Gheza
 
David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...
David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...
David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...webcontent2007
 
Search And Sensibilty Four Tales Of Search
Search And Sensibilty Four Tales Of SearchSearch And Sensibilty Four Tales Of Search
Search And Sensibilty Four Tales Of SearchScott\ Bryant
 
The Future Of SEO/Content Marketing
The Future Of SEO/Content MarketingThe Future Of SEO/Content Marketing
The Future Of SEO/Content MarketingBritney Muller
 
The Internet's Value To The Average Person
The Internet's Value To The Average PersonThe Internet's Value To The Average Person
The Internet's Value To The Average PersonBrad Murray
 
Bitsquatting: Exploiting bit-flips for fun, or profit?
Bitsquatting: Exploiting bit-flips for fun, or profit?Bitsquatting: Exploiting bit-flips for fun, or profit?
Bitsquatting: Exploiting bit-flips for fun, or profit?nicknikiforakis
 
Chat Smarter At Penn State
Chat Smarter At Penn StateChat Smarter At Penn State
Chat Smarter At Penn StateJohn Meier
 
SEO & Large websites - Search University 2012
SEO & Large websites - Search University 2012SEO & Large websites - Search University 2012
SEO & Large websites - Search University 2012Sven De Meyere
 
Tracking Gaps in the User Experience
Tracking Gaps in the User ExperienceTracking Gaps in the User Experience
Tracking Gaps in the User ExperienceCharles Meaden
 
An Intro To SEO, SEM & Internet Marketing
An Intro To SEO, SEM & Internet MarketingAn Intro To SEO, SEM & Internet Marketing
An Intro To SEO, SEM & Internet MarketingDave Davies
 
Google Tools Visitor2 Customers
Google Tools Visitor2 CustomersGoogle Tools Visitor2 Customers
Google Tools Visitor2 CustomersCorporate College
 
How To Build The Open Mesh 09
How To Build The Open Mesh 09How To Build The Open Mesh 09
How To Build The Open Mesh 09Marc Canter
 
Community 2.0 Community Bootcamp: the technology part by Tara Hunt
Community 2.0 Community Bootcamp: the technology part by Tara HuntCommunity 2.0 Community Bootcamp: the technology part by Tara Hunt
Community 2.0 Community Bootcamp: the technology part by Tara HuntTara Hunt
 
The Latest in Association Technology
The Latest in Association TechnologyThe Latest in Association Technology
The Latest in Association TechnologyDistilled Logic
 
Webinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak SeasonWebinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak SeasonSOASTA
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringChris Gates
 
Vertical Search - Brian Combs IS08
Vertical Search - Brian Combs IS08Vertical Search - Brian Combs IS08
Vertical Search - Brian Combs IS08ISconference
 

Similaire à Search Engine Google (20)

Haifa
HaifaHaifa
Haifa
 
SEO for the Semantic Web
SEO for the Semantic WebSEO for the Semantic Web
SEO for the Semantic Web
 
David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...
David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...
David Esrati, The Blogzilla Report- Fact, Fiction Fear: The Monster of the In...
 
Search And Sensibilty Four Tales Of Search
Search And Sensibilty Four Tales Of SearchSearch And Sensibilty Four Tales Of Search
Search And Sensibilty Four Tales Of Search
 
The Future Of SEO/Content Marketing
The Future Of SEO/Content MarketingThe Future Of SEO/Content Marketing
The Future Of SEO/Content Marketing
 
The Internet's Value To The Average Person
The Internet's Value To The Average PersonThe Internet's Value To The Average Person
The Internet's Value To The Average Person
 
Bitsquatting: Exploiting bit-flips for fun, or profit?
Bitsquatting: Exploiting bit-flips for fun, or profit?Bitsquatting: Exploiting bit-flips for fun, or profit?
Bitsquatting: Exploiting bit-flips for fun, or profit?
 
Chat Smarter At Penn State
Chat Smarter At Penn StateChat Smarter At Penn State
Chat Smarter At Penn State
 
SEO & Large websites - Search University 2012
SEO & Large websites - Search University 2012SEO & Large websites - Search University 2012
SEO & Large websites - Search University 2012
 
Seo Made Easy
Seo Made EasySeo Made Easy
Seo Made Easy
 
Tracking Gaps in the User Experience
Tracking Gaps in the User ExperienceTracking Gaps in the User Experience
Tracking Gaps in the User Experience
 
An Intro To SEO, SEM & Internet Marketing
An Intro To SEO, SEM & Internet MarketingAn Intro To SEO, SEM & Internet Marketing
An Intro To SEO, SEM & Internet Marketing
 
Web 2.0 Expo
Web 2.0 ExpoWeb 2.0 Expo
Web 2.0 Expo
 
Google Tools Visitor2 Customers
Google Tools Visitor2 CustomersGoogle Tools Visitor2 Customers
Google Tools Visitor2 Customers
 
How To Build The Open Mesh 09
How To Build The Open Mesh 09How To Build The Open Mesh 09
How To Build The Open Mesh 09
 
Community 2.0 Community Bootcamp: the technology part by Tara Hunt
Community 2.0 Community Bootcamp: the technology part by Tara HuntCommunity 2.0 Community Bootcamp: the technology part by Tara Hunt
Community 2.0 Community Bootcamp: the technology part by Tara Hunt
 
The Latest in Association Technology
The Latest in Association TechnologyThe Latest in Association Technology
The Latest in Association Technology
 
Webinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak SeasonWebinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak Season
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information Gathering
 
Vertical Search - Brian Combs IS08
Vertical Search - Brian Combs IS08Vertical Search - Brian Combs IS08
Vertical Search - Brian Combs IS08
 

Plus de Chidanand Byahatti (6)

Seo Search Engine Marketing
Seo Search Engine MarketingSeo Search Engine Marketing
Seo Search Engine Marketing
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Intro Html
Intro HtmlIntro Html
Intro Html
 
Html
HtmlHtml
Html
 

Dernier

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Search Engine Google

  • 1. Challenges in Running a Commercial Web Search Engine Amit Singhal
  • 2. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google
  • 3. Introduction • Crawling – Follow links to find information • Indexing – Record what words appear where • Ranking – What information is a good match to a user query? – What information is inherently good? • Displaying – Find a good format for the information • Serving – Handle queries, find pages, display results
  • 4. History • The web happened (1992) • Mosaic/Netscape happened (1993-95) • Crawler happened (1994): M. Mauldin • SEs happened 1994-1996 – InfoSeek, Lycos, Altavista, Excite, Inktomi, … • Yahoo decided to go with a directory • Google happened 1996-98 – Tried selling technology to other engines – SEs though search was a commodity, portals were in • Microsoft said: whatever …
  • 5. Present • Most search engines have vanished • Google is a big player • Yahoo decided to de-emphasize directories – Buys three search engines • Microsoft realized Internet is here to stay – Dominates the browser market – Realizes search is critical
  • 6. History • Early systems Information Retrieval based – Infoseek, Altavista, … • Information Retrieval – Field started in the 1950s – Primarily focused on text search – Already had written-off directories (1960s) – Mostly uses statistical methods to analyze text
  • 7. History • IR necessary but not sufficient for web search • Doesn’t capture authority – Same article hosted on BBC as good as a slightly modified copy on john-doe-news.com • Doesn’t address web navigation – Query ibm seeks www.ibm.com – To IR www.ibm.com may look less topical than a quarterly report
  • 8. History • But there are links – Long history in citation analysis – Navigational tools on the web – Also a sign of popularity – Can be thought of as recommendations (source recommends destination) – Also describe the destination: anchor text
  • 9. History • Link analysis – Hubs and authority (Jon Kleinberg) • Topical links exploited • Query time approach – PageRank (Brin and Page) • Computed on the entire graph • Query independent • Faster if serving lots of queries – Others…
  • 10. History • Google showed link analysis can make a huge difference and is practical too – Everyone else followed • Then there is the secret sauce – Link analysis – Information retrieval – Anchor text – Other stuff
  • 11. History • Interfaces – Many alternatives existed/exist • Simple ranked list • Keywords in context snippets (Google first SE to do this) • Topics/query suggestion tools (e.g. Vivisimo, Teoma) • Graphical, 2-D, 3-D – Simple and clean preferred by users • Like relevance ranking • Like keywords in context snippets
  • 12. End Product • As of today – Users give a 2-4 word query – SE gives a relevance ranked list of web pages – Most users click only on the first few results – Few users go below the fold • Whatever is visible without scrolling down – Far fewer ask for the next 10 results
  • 13. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google
  • 14. Oh No … This is REAL • 80% of users use search engines to find sites
  • 15. Enter the Greedy Spammer • Users follow search results • Money follows users, spam follows … • There is value in getting ranked high – Affiliate programs • Siphon traffic from SEs to Amazon/eBay/… – Make a few bucks • Siphon traffic from SEs to a Viagra seller – Make $6 per sale • Siphon traffic from SEs to a porn site – Make $20-$40 per new member
  • 16. Big Money • Let’s do the math • How much can the spam industry make by spamming search engines? – Assume 500M searches/day on the web • All search engines combined – Assume 5% commercially viable • Much more if you include porn queries – Assume $0.50 made per click (from 5c to $40) – $12.5M/day or about $4.5 Billion/year
  • 17. How? • Defeat IR – Keyword stuffing – Crawlers declare that it is a SE spider – They dish us an “optimized” page
  • 18. But that should be easy… • Just detect keyword density
  • 19. But that is easy too… • Just detect that page is not about query
  • 20. Legitimate NLP Parse • Noun phrase to noun phrase
  • 21. But links should help… • No one should link to these bad sites – Expired domains • The owner of a legitimate domain doesn’t renew it • Spammers grab it, it already has tons of incoming links • E.g., anchor text for – The War on Freedom – The War on Freedom: How and Why America was attacked – The War on Freedom
  • 25. State of Affairs • There is big money in spamming SEs • Easy to get links from good sites • Easy to generate search algorithm friendly pages • Any technique can be and will be attacked by spammers • Have to make sense out of this chaos
  • 26. We counter it well • Most SEs are still very useful – Used over 500 million times every day • All search engines put together • Our internal measurements show that we are winning • Still need to be watchful
  • 28. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google
  • 29. Information Retrieval • Test collection paradigm of evaluation – Static collection of documents (few million) – A set of queries (around 50-100) – Relevance judgments – Extensive judgments not possible (100x1,000,000) – Use pooling • Pool top 1000 results from various techniques • Assume all possible relevant documents judged • Biased against revolutionary new methods – Judge new documents if needed
  • 30. On the Web • Collection is dynamic – 10-20% urls change every month – Spam methods are dynamic – Need to keep the collection recent • Queries are also time sensitive – Topics are hot then not – Need to keep a representative sample
  • 31. On the Web • Search space is HUGE – Over 200 million queries a day – Over 100 million are unique – Need 2700 queries for a 5% (700 for 10%) improvement to be meaningful at 95% confidence • Search space is varied – Serve 90 different languages – Can’t have a catastrophic failure in any – Monitoring every part of the system is non-trivial • IR style evaluation – Incredibly expensive – Always out of date
  • 32. On the Web • But what about user behavior? – You can use clicks as supervision. • Clicks – Incredibly noisy – A click on a result does not mean a vote for it • The destination may just be a traffic peddler • User taken to some other site • If anything, this (clicked) result was BAD
  • 33. Blue and Gold Fleet
  • 34. We do Very Well • Continually evaluate our system – In multiple languages – Tests valid over large traffic – Caught many possible disasters • Constantly launch changes/products – Stemming, Google News, Froogle, Usenet, …
  • 35. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google – Finding Needles in a 20 TB Haystack, 200M times per day
  • 36. Past 1995 research project at Stanford University
  • 37. Lego Disk Case One of our earliest storage systems
  • 39. Growth • Nov. 98: 10,000 queries on 25 computers • Apr. 99: 500,000 queries on 300 computers • Sept. 99: 3M queries on 2,100 computers
  • 41. Datacenters now And 3 days later…
  • 42. Where the users are…
  • 43. What can we learn… • Structure of Web • Interests of Users • Trends and Fads • Languages • Concepts • Relationships
  • 45. Google • Ethics – No pay for inclusion (in index) – No pay for placement (in ranking) – Clearly demarked results and ads – 20% engineer time doing random stuff • Out came news, froogle, orkut – Users come first