SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
+




    Engineering Challenges
    in Vertical Search Engines
    Aleksandar Bradic, Senior Director,
    Engineering and R&D, Vast.com
+
    Introduction

        Vertical Search
             Search focused on vertical data
             Vertical Data – data inherently described by it’s structure:
                Items/Properties for sale (Automotive, Real Estate..)

                  Geographical Data (Neighborhoods, Locations..)
                  Services (Hotels, Transportation..)
                  Businesses (Restaurants, Nightlife..)
                  Events (Concerts, Plays..)
                  Auction items (Collectibles, Art..)
                  Metadata (News, Social Data, Reviews..)
                  …
+
    Introduction

        Vertical Search != Full Text Search
             Full Text Search queries:
                “Cheap tickets for Broadway shows this week”
                “Trendy Restaurants in San Francisco near SoMa”
                “3-day trips from NYC to anywhere under $1000”
             Vertical Search queries:
                “price-sorted results bellow two standard deviations from tickets
                 category with Broadway as location and date range of 2010-04-11 to
                 2010-04-18”
                “distance-sorted results relative to center of SF/SoMa matching the
                 appropriate threshold of composite score of user review scores and
                 historical change in query/review volume”
                “total cost-sorted results for all 3-day intervals within next 6 months
                 combining hotel and airfare price bellow max value of $1000 for all
                 valid locations”
+
    Introduction

        Vertical Search = search on structured data

        Vertical Search at Web-Scale:
             Web-Scale datasets
             Web-Scale query volumes
             Interactive operation
             Low latency requirements
             Utility maximization across all involved parties

        => loads of fun ! : )
+
    @Vast.com

        Vast.com : Vertical Search & Analytics Platform

        Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest
         Airlines, etc..
+
    @Vast.com

        Daily processing up to 1Tb of unstructured and semi-
         structured Web data

        Managing ~150M records operational dataset across multiple
         verticals

        Handling > 1000 query/sec peak search query loads



        We’re hiring ! : )
+
    Challenges in Vertical Search
    Engines
        Web Data Retrieval

        Unstructured Data

        Data Processing Infrastructures

        Vertical Search

        Data Analytics

        Computational Advertising
+
    Web Data Retrieval

        Crawler Architecture
             Queue Management
             Crawl Ordering Policies
             Duplicate URL Detection
             Content Hash Management
             Politeness Management
             Coverage Measurement
             Freshness Optimization
             Incremental Crawling
+
    Web Data Retrieval

        ”Deep Web” crawling
             Locating Deep Web Content Sources
             Selecting Relevant Sources
             Estimating Database Size
             Understanding Content / Form Detection
             Automatic Dispatch of HTML Forms
             Predicting content in free text forms
             Crawling non-HTML Content
             Estimating Query Result Sparsity
             URL Generation problem
             Query Covering Problem
+
    Web Data Retrieval

        Focused (Topical) Crawling
             Content Classification
             Link Content Prediction
             Topic Relevance Estimation

        Modeling Temporal Characteristics
             Site-Level Evolution
             Page-Level Evolution

        Adversarial Crawling
             Web Spam Detection
             Cloaked Content Detection
+
    Unstructured Data

        Unstructured Data – information that does not have a pre-
         defined data model

        Handling Unstructured Data:
             Data Cleaning
             Tagging with Metadata
             Vertical Classification
             Schema Matching
             Information Extraction


    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
make            model   year    trim          price                  ???
+
    Unstructured Data

        Information extraction from unstructured, ungrammatical
         data
             Reference Sets - relational data sets that consist of collection of
              known entities with associated common attributes
             Reference Set Selection
             Reference Set Generation
             Record Linkage : Finding “best matching” member of reference
              set corresponding post
             Challenge : Automatic Generation of Reference Sets
+
    Data Processing Infrastructures

        Infrastructures for continuous processing of unbounded streams
         of unstructured data
        Information Extraction as part of processing (non-trivial
         computation per each processed entry)

        Inherently distributed infrastructures - in order to support
         performance and scalability

        Time-to-site constraints. Ability to process out-of band data.

        Support for complex operations on aggregated data (de-
         duplication, static ranking, data enrichment, data cleaning/
         filtering …)

        Support for data archival and off-line analysis
+
    Data Processing Infrastructures
+
    Data Processing Infrastructures

        Distributed Computing Platforms:

             Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

             Stream-oriented (Flume, S4, Stream SQL…)

             Distributed Data Stores (Dynamo/Cassandra/Riak…)

        The curse of CAP Theorem:
             It is impossible for a distributed system to simultaneously provide
              all three of the following guarantees:
                Consistency
                Availability
                Partition tolerance
+
    Vertical Search

        Large-Scale structured data search

        Providing both analytic and canonical set of Information
         Retrieval functionalities

        Entries are represented in Vector Space Model

        Each result is represented as data point – tuple consisting of
         appropriate number of fields :

         (make, model, year, trim …)
+
    Vertical Search

        Search in Vector Space Model
             Resulting subset generation
             Sorting as linearization using selected metric
             Dynamic subset criteria calculation
             Search Result Clustering
             “Similar” result search
             …



… with up to ~100 ms milliseconds response time
… at 10M+ records in index
… handling 100+ queries/sec/host
+
    Vertical Search

        Faceted Search
             fac-et (fas’it) :
                1. One of the flat polished surfaces cut on a gemstone or occurring
                 naturally on a crystal.
                2. One of numerous aspects, as of a subject.


             Vocabulary problem for faceted data
             Facet Design / selection
                "the keywords that are assigned by indexers are often at
                  odds with those tried by searchers.”
                Selection of information-distinguishing facet values
             User-specific faceted search
             Dynamic correlated facet generation
             Distributing facet computation
+
    Data Analytics

        Clickstream Data Analysis

        Learning from implicit user feedback

        Anonymous user clustering

        Learning to rank

        Inventory/Market Trends

        Rare Event detection

        Price Prediction

        Spam Content detection
+
    Data Analytics

        Challenges:
             “Good Deal” detection
             Recommendation Systems for Vertical Data with no explicit user
              feedback
             Accuracy of Automatic Valuation Models
             Data-driven feature design
             Click Prediction
             User Behavior Modeling
+
    Computational Advertising

        The central problem of computational advertising is to find
         the "best match" between a given user in a given context and a
         suitable advertisement.




    ads


                                                                          ads




                                         search results !
+
    Computational Advertising

        Vertical Search presents an additional challenge in the sense
         that any of the actual search results can be “sponsored”




                                                                   ad ?




                                                                   ad ?
+
    Computational Advertising

        Central challenge:
             Find the “best match” between a given user in a given context
              and a suitable advertisement
             “best match” – maximizing the value for :
                  Users
                  Advertisers
                  Publishers
             Each of the parties has different set of utilities:
                Users want relevance

                  Advertisers want ROI and volume
                  Publishers want revenue per impression/search
+
    Computational Advertising

        CTR (ClickThrough Rate Estimation):
             Reactive (statistically significant historical CTR)
             Predictive (CTR estimated from features of ads)
             Hybrid (historical + predictive)


             Personalization of CTR Computation ?
             Dynamic CTR Estimation (online algorithms)




                                  P(click) = ?
+
    Computational Advertising

        Analytical Aparatus:
             Regression Analysis (Linear, Logistic, probit model, High
              Dimensional methods)
             Game Theory (Nash Equilibria, dominant strategy)
             Auction Theory (Vickrey, GSP, VCG…)
             Graph Theory (random walks on graphs, graph matching, etc.)
             Information Retrieval Techniques (similarity metrics, etc.)
             …
+
    Conclusion

        Vertical Search & Analytics at Web Scale == fun !!!

        Source of large number of relevant research & engineering
         problems !

        Opportunity to tackle wide spectra of techniques across all
         areas of Computer Science and Engineering !




                                       Jump on the bandwagon ! : )

Contenu connexe

Similaire à Engineering challenges in vertical search engines

Similaire à Engineering challenges in vertical search engines (20)

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web Technologies
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation Engine
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the Web
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWS
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdf
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analytics
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information Systems
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Public
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 

Plus de ITDogadjaji.com

How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event Presentation
ITDogadjaji.com
 

Plus de ITDogadjaji.com (20)

Game Design 101
Game Design 101Game Design 101
Game Design 101
 
Uvod u Gejmifikaciju
Uvod u GejmifikacijuUvod u Gejmifikaciju
Uvod u Gejmifikaciju
 
Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in Serbia
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center Serbia
 
ICT Clusters
ICT ClustersICT Clusters
ICT Clusters
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event Presentation
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities
 
Mobipatrol
MobipatrolMobipatrol
Mobipatrol
 
Mediatoolkit
MediatoolkitMediatoolkit
Mediatoolkit
 
Taksiko
TaksikoTaksiko
Taksiko
 
SiteCake
SiteCakeSiteCake
SiteCake
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivot
 
How to (Win on the) Web
How to (Win on the) WebHow to (Win on the) Web
How to (Win on the) Web
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing up
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first base
 
birthdaysRock.com
birthdaysRock.combirthdaysRock.com
birthdaysRock.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the world
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentation
 
Ekspertlink
EkspertlinkEkspertlink
Ekspertlink
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Engineering challenges in vertical search engines

  • 1. + Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com
  • 2. + Introduction   Vertical Search   Search focused on vertical data   Vertical Data – data inherently described by it’s structure:   Items/Properties for sale (Automotive, Real Estate..)   Geographical Data (Neighborhoods, Locations..)   Services (Hotels, Transportation..)   Businesses (Restaurants, Nightlife..)   Events (Concerts, Plays..)   Auction items (Collectibles, Art..)   Metadata (News, Social Data, Reviews..)   …
  • 3. + Introduction   Vertical Search != Full Text Search   Full Text Search queries:   “Cheap tickets for Broadway shows this week”   “Trendy Restaurants in San Francisco near SoMa”   “3-day trips from NYC to anywhere under $1000”   Vertical Search queries:   “price-sorted results bellow two standard deviations from tickets category with Broadway as location and date range of 2010-04-11 to 2010-04-18”   “distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume”   “total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”
  • 4. + Introduction   Vertical Search = search on structured data   Vertical Search at Web-Scale:   Web-Scale datasets   Web-Scale query volumes   Interactive operation   Low latency requirements   Utility maximization across all involved parties   => loads of fun ! : )
  • 5. + @Vast.com   Vast.com : Vertical Search & Analytics Platform   Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest Airlines, etc..
  • 6. + @Vast.com   Daily processing up to 1Tb of unstructured and semi- structured Web data   Managing ~150M records operational dataset across multiple verticals   Handling > 1000 query/sec peak search query loads   We’re hiring ! : )
  • 7. + Challenges in Vertical Search Engines   Web Data Retrieval   Unstructured Data   Data Processing Infrastructures   Vertical Search   Data Analytics   Computational Advertising
  • 8. + Web Data Retrieval   Crawler Architecture   Queue Management   Crawl Ordering Policies   Duplicate URL Detection   Content Hash Management   Politeness Management   Coverage Measurement   Freshness Optimization   Incremental Crawling
  • 9. + Web Data Retrieval   ”Deep Web” crawling   Locating Deep Web Content Sources   Selecting Relevant Sources   Estimating Database Size   Understanding Content / Form Detection   Automatic Dispatch of HTML Forms   Predicting content in free text forms   Crawling non-HTML Content   Estimating Query Result Sparsity   URL Generation problem   Query Covering Problem
  • 10. + Web Data Retrieval   Focused (Topical) Crawling   Content Classification   Link Content Prediction   Topic Relevance Estimation   Modeling Temporal Characteristics   Site-Level Evolution   Page-Level Evolution   Adversarial Crawling   Web Spam Detection   Cloaked Content Detection
  • 11. + Unstructured Data   Unstructured Data – information that does not have a pre- defined data model   Handling Unstructured Data:   Data Cleaning   Tagging with Metadata   Vertical Classification   Schema Matching   Information Extraction Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! make model year trim price ???
  • 12. + Unstructured Data   Information extraction from unstructured, ungrammatical data   Reference Sets - relational data sets that consist of collection of known entities with associated common attributes   Reference Set Selection   Reference Set Generation   Record Linkage : Finding “best matching” member of reference set corresponding post   Challenge : Automatic Generation of Reference Sets
  • 13. + Data Processing Infrastructures   Infrastructures for continuous processing of unbounded streams of unstructured data   Information Extraction as part of processing (non-trivial computation per each processed entry)   Inherently distributed infrastructures - in order to support performance and scalability   Time-to-site constraints. Ability to process out-of band data.   Support for complex operations on aggregated data (de- duplication, static ranking, data enrichment, data cleaning/ filtering …)   Support for data archival and off-line analysis
  • 14. + Data Processing Infrastructures
  • 15. + Data Processing Infrastructures   Distributed Computing Platforms:   Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)   Stream-oriented (Flume, S4, Stream SQL…)   Distributed Data Stores (Dynamo/Cassandra/Riak…)   The curse of CAP Theorem:   It is impossible for a distributed system to simultaneously provide all three of the following guarantees:   Consistency   Availability   Partition tolerance
  • 16. + Vertical Search   Large-Scale structured data search   Providing both analytic and canonical set of Information Retrieval functionalities   Entries are represented in Vector Space Model   Each result is represented as data point – tuple consisting of appropriate number of fields : (make, model, year, trim …)
  • 17. + Vertical Search   Search in Vector Space Model   Resulting subset generation   Sorting as linearization using selected metric   Dynamic subset criteria calculation   Search Result Clustering   “Similar” result search   … … with up to ~100 ms milliseconds response time … at 10M+ records in index … handling 100+ queries/sec/host
  • 18. + Vertical Search   Faceted Search   fac-et (fas’it) :   1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal.   2. One of numerous aspects, as of a subject.   Vocabulary problem for faceted data   Facet Design / selection   "the keywords that are assigned by indexers are often at odds with those tried by searchers.”   Selection of information-distinguishing facet values   User-specific faceted search   Dynamic correlated facet generation   Distributing facet computation
  • 19. + Data Analytics   Clickstream Data Analysis   Learning from implicit user feedback   Anonymous user clustering   Learning to rank   Inventory/Market Trends   Rare Event detection   Price Prediction   Spam Content detection
  • 20. + Data Analytics   Challenges:   “Good Deal” detection   Recommendation Systems for Vertical Data with no explicit user feedback   Accuracy of Automatic Valuation Models   Data-driven feature design   Click Prediction   User Behavior Modeling
  • 21. + Computational Advertising   The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. ads ads search results !
  • 22. + Computational Advertising   Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored” ad ? ad ?
  • 23. + Computational Advertising   Central challenge:   Find the “best match” between a given user in a given context and a suitable advertisement   “best match” – maximizing the value for :   Users   Advertisers   Publishers   Each of the parties has different set of utilities:   Users want relevance   Advertisers want ROI and volume   Publishers want revenue per impression/search
  • 24. + Computational Advertising   CTR (ClickThrough Rate Estimation):   Reactive (statistically significant historical CTR)   Predictive (CTR estimated from features of ads)   Hybrid (historical + predictive)   Personalization of CTR Computation ?   Dynamic CTR Estimation (online algorithms) P(click) = ?
  • 25. + Computational Advertising   Analytical Aparatus:   Regression Analysis (Linear, Logistic, probit model, High Dimensional methods)   Game Theory (Nash Equilibria, dominant strategy)   Auction Theory (Vickrey, GSP, VCG…)   Graph Theory (random walks on graphs, graph matching, etc.)   Information Retrieval Techniques (similarity metrics, etc.)   …
  • 26. + Conclusion   Vertical Search & Analytics at Web Scale == fun !!!   Source of large number of relevant research & engineering problems !   Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering ! Jump on the bandwagon ! : )