SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
Lucene @ Yelp

Sudarshan Gaikaiwari
Bio

1. Over a decade of experience in information retrieval
2. Used IR techniques at Symantec's DLP group
3. Search Engineer at Yelp
Outline

1. Overview of search services at Yelp
2. Federation Motivation
3. Lucy Indexing
4. Lucy Searching
5. Efficiently Retrieving top k hits
The services we provide
Lucy: business search
Lucy also powers phone search
Cathy: she 'talks' a lot
Listsearch: it searches lists....
Reviewsearch: it searches reviews....
DYM: did you really mean that?
Suggest: auto completion
Federation Motivation
Problem




      Search is too slow
Hard Disk Seek Latency
                     Disk seek 10,000,000 ns




                         Source Software Engineering Advice from
                         Building Large-Scale Distributed Systems
                         Jeffery Dean
RAM read latency

                   Main memory
                   reference
                   100 ns
Pinning Index in RAM

● vmtouch
● mlock
● http://hoytech.com/vmtouch/
Problem

Index is too large fit in memory on a single machine
Geographical sharding
Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary
2. No guarantee that a boundary can be found.
Federation

1. Split index across multiple machines
2. Shard on business id
3. TF-IDF scores from different machines should be
   comparable
Mapping businesses to shards

 1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems
1. Involves re-indexing all the businesses if we want to add a
new shard
Virtual Nodes
Advantages

1. Flexibility (move vbuckets from one shard to another)
2. Split hot spot shards
Lucy Master Slave Architecture

Separate indexing (masters)
A master for each shard of a service

Searching (slaves)
A slave for every replica of a service
Lucy Indexing
Lucy Searching
Federator: Combining results across
shards
1. Once we distribute an index across shards we need a
   component which will search all these shards and combine
   their results.
2. Written in Python (runs inside a python web process).
3. Uses Tornado IO loop to send requests to all shards.
4. The transfer protocol for the requests in JSON RPC
Lucy Server
Tokens to Business Attributes
Executing queries

1. Gather the top results for a query
2. Collect attribute statitics for attributes like places, categories
Lucene

1. Efficiently executes queries over the index
2. Provides how relevant the business is to the words in the
   query (word score)
3. Upgrading lucene to 2.9/3.1 is WIP
Successive geobounds relaxation
Successive geobounds relaxation
Federation
Efficiently Retrieving top k hits

 1. When user moves through multiple pages the number of
    hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be to
retrieve 500 hits from each shard and then sort them
Distribution of hits in shards
Probability a hit is in a shard
Binomial Distribution
Probability (r of top k hits) are in a particular shard




Mean



Variance
Formula

Std Deviation




Formula
Simulation


    Formula   Hits selected from each   Results Missed (%)
              shard
              k = 100
              p = 0.2

                          24                      0.017




                          32                    0.0001407




                          44                    0.00000
Simulation Graph
Results

1. ~ 50% savings over 100 hits (44 hits requested from each
   shard)
2. 77% savings over 1000 hits (228 hits requested from each
   shard)
Future work

1. In memory index
2. Move towards real time search
Come Join Us!
Thank You




            smg@yelp.com

Contenu connexe

En vedette

NRI Residency Pari Chowk Gr. noida for confirmed allotment call 9540110008
NRI Residency Pari Chowk  Gr. noida  for confirmed allotment  call 9540110008NRI Residency Pari Chowk  Gr. noida  for confirmed allotment  call 9540110008
NRI Residency Pari Chowk Gr. noida for confirmed allotment call 9540110008
Gaurav Arora
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
lucenerevolution
 

En vedette (6)

Cpython embedded in solr - By Roman Chyla
Cpython embedded in solr - By Roman Chyla Cpython embedded in solr - By Roman Chyla
Cpython embedded in solr - By Roman Chyla
 
Motion to allow discovery
Motion to allow discoveryMotion to allow discovery
Motion to allow discovery
 
How to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/SolrHow to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/Solr
 
NRI Residency Pari Chowk Gr. noida for confirmed allotment call 9540110008
NRI Residency Pari Chowk  Gr. noida  for confirmed allotment  call 9540110008NRI Residency Pari Chowk  Gr. noida  for confirmed allotment  call 9540110008
NRI Residency Pari Chowk Gr. noida for confirmed allotment call 9540110008
 
The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 

Similaire à Lucene at Yelp - By Sudarshan Gaikaiwari

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet
 

Similaire à Lucene at Yelp - By Sudarshan Gaikaiwari (20)

(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
 
Distributed Deep Learning with Docker at Salesforce
Distributed Deep Learning with Docker at SalesforceDistributed Deep Learning with Docker at Salesforce
Distributed Deep Learning with Docker at Salesforce
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
"Introducing Distributed Tracing in a Large Software System",  Kostiantyn Sha..."Introducing Distributed Tracing in a Large Software System",  Kostiantyn Sha...
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
 
Realtime search at Yammer
Realtime search at YammerRealtime search at Yammer
Realtime search at Yammer
 
Real-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky BorisReal-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky Boris
 
Real Time Search at Yammer
Real Time Search at YammerReal Time Search at Yammer
Real Time Search at Yammer
 
Puppetcamp Melbourne - puppetdb
Puppetcamp Melbourne - puppetdbPuppetcamp Melbourne - puppetdb
Puppetcamp Melbourne - puppetdb
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Functional solid
Functional solidFunctional solid
Functional solid
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
SenseiDB
SenseiDBSenseiDB
SenseiDB
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Crm saturday madrid 2017 jordi montaña - test automation
Crm saturday madrid 2017   jordi montaña - test automationCrm saturday madrid 2017   jordi montaña - test automation
Crm saturday madrid 2017 jordi montaña - test automation
 

Plus de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Lucene at Yelp - By Sudarshan Gaikaiwari