SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Fast & relevant search
solutions and trade-offs
Sylvain Utard & Marwan Burelle
Engineering @ Algolia
Agenda
Introduction
Relevant or and fast ?
Implementation perspective
1
2
3
3
Introduction
─ Senior Engineer @ Search Core
─ Focus on performance
─ Also
─ Lecturer & Researcher
─ Parallel prog., graph algorithms & malware classification
Sylvain Utard ─ VP of Engineering
─ Joined them as their 1st employee
─ Leading a team of 100+ engineers
─ Also
─ Full-stack engineer (C++, Java, Ruby, JS/TS)
─ Part-time Textmining teacher
Marwan Burelle
─ Search-as-a-service - REST Search API
─ 75B searches processed every month
─ 2000+ bare-metal servers (30k vCPUs, 300TB RAM, 2.5PB SSD)
─ Home-made search technology (mainly C++ & Nginx)
─ Company
─ Founded in 2013
─ 6 offices (Paris, SF, London, NYC, Atlanta & Tokyo)
─ 350 people (inc. 100 engineers, 25 products)
─ Series C in 2019, $110M (total $184M)
Algolia TLDR;
9k+ customers, 50k+ free users
E-COMMERCE MEDIA & GAMING TECHNOLOGY / SAAS
Financial Services Public Sector & Education Services
─ Focusing on a very specific subset of what search engines are used for
─ User-facing search bars
─ Searching structured data
─ E-Com products, categories
─ Social Network profiles
─ SAAS app objects
─ Relevance is key
─ Today: only Google, Amazon, Netflix, Facebook, Apple are really nailing it
─ Bar keeps getting higher & higher
─ Users now expect this fast & relevant experience everywhere!
Consumer-grade search experience
8
Relevant or and fast ?
─ Textmining/NLP research VS the reality
─ Mostly about searching unstructured data (the web or academic datasets)
─ Where the search engine owner doesn’t necessarily own/control the data to search in
─ Text relevance through statistics might not always be the best option
─ Social network: searching for a common name (ex: “Jonas Schmidt”) would be scored differently?
─ E-Com: searching for “iPhone” would score differently if “iPhone” is mentioned multiple times?
Focus on relevance
Focus on relevance
─ Algolia is leveraging a tie-breaking
approach
─ 5 text-relevance criteria (inc. “attribute”)
─ 1 geo-relevance criterion
─ 1 filter-relevance criterion
─ Amend this list with
─ Your own business popularity
─ Your own score
─ Initially designed for mobile phones & search-as-you-type experiences
─ Prefix-search built-in (autocomplete)
─ Huge focus on typo-tolerance (fat fingers)
─ Had to run on an iPhone 3G (128MB RAM & 412Mhz) or Android equivalents
─ Prefix-search is easy
─ Typo-tolerant search is easy
─ Fast search is easy
Focus on speed & exhaustivity:
search-as-you-type
Combining all while staying relevant is hard !!!
Focus on speed: hardware & network
─ Fine-tuned hardware & OS
─ Bunch of kernel optimization (I/O scheduling, network & ram/disk buffers)
─ Packaged as a static library & linked within nginx
─ Separated build process running on the same machines
─ Clusters of 3 machines (min) for HA
─ Master-master replication
─ High-end servers for performance
─ Worldwide (built-in) replication
─ 70 datacenters across 16 regions
─ Horizontal scalability @ search, wherever your end-users are
RAM: 64-128GB
Proc: 8-12 cores, 3.2-3.8 GHz
SSD: 2x300-800 GB Raid-0
─ Exhaustivity is usually not required
─ … or actually not used
─ Users are really looking at the top hits (only)!
Focus on exhaustivity
14
Implementation perspective
Performance Oriented Design
● Technical choices: C++, nginx, infrastructures …
● Continuous performances evaluations
● Never stop looking for improvements
Performance drives the dev
● Always try to minimize computations at query time
○ TRIE designed for efficient retrieval, prefix matching and typo-tolerance
○ Inverted lists approach: links words to document ids, uses compressed integers
○ Extra indexing: facets, empty query cache (top 1000 records) ...
● Index format mapped directly in memory
● As you need typo tolerance
Performances drives design
Anatomy of a query
Prepare Index
Query Parsing
Collect hits
Post-processing
Formating
● Unavoidable but relatively fast
● We keep indices hot in memory to minimize slow start
● TRIE and inverted index lists
● Subject to timeouts
● Dynamic ranking, aggregation (distinct), faceting ...
● Most expensive part
● Can not be bound
● Depends on query paginations
Only get what you need
● Most queries only require the first page, usually no more than 10 hits
● We limit hits retrieval in time and quantity
● Alternative forms (synonyms and typos) are searched only if we need more hits
● We use linear approximation to compute number of hits when reaching limits
Only get what you need
Results Set Enough
hits ?
Go deeper
Use alternatives
Accept more typos
…
Remove keywords
● Records are ranked at indexing time
● Internal document ids ordered by ranking
● Most relevant records are tops of the list
● Bounding search returns relevant hits
Static ranking
5
53
99
204
237
402
507
661
662
666
803
13
53
101
204
237
408
507
666
803
990
1031
53
204
237
507
666
∩
Retrieving the 5
best results for two
keywords
Numerical Facets Statistics
● Compute basic stats on numerical facets (min, max, average)
● Based on all records satisfying the current query
● Do it for all present facets (or a subset of it)
Sounds easy, isn’t it ?
The problem
maxmin
● We are schemaless
● Input format doesn’t constraint
numeric types (integers, floats … )
● We may have way more matching
hits than retrieved ones
● Can’t easily identify numeric
facets
● Requires a lot of conversions
● We need to scan all matching
records
Constraints Consequences
The problem
Current solution:
● Only compute stats for the first thousand hits
● Cost only depends on the number of facets
● Fast for reasonable number of facets
● Works great in most cases !
Sampling
● Static ranking may not correlate with all facets
○ Extreme values (min/max) may not belong to the samples
○ The samples may not be representative for average computation
● Still expensive with a large number of facets
● Most expensive operations:
○ Checking if a facet is numeric
○ Converting values to appropriate representation
Sampling
New version:
● Pre-computed statistics for empty queries
● Extra indexing information
○ Which facets are numeric
○ Decoded value in suitable binary representation
=> 30% faster at query time for less than 3% of extra indexing time
=> No more inconsistent results
Pre-computing
Personalized Search
─ Events are sent to an ingestion API
Clicks
Conversions
Views
Events ingestion & analytics
Query ID User/Group Token
Index
● Personalization profiles/scores indexed asynchronously
● Personalization profiles/scores stored using a TRIE for quick retrieval
● Ranking impact, if the query is providing a user/group Token:
○ For each hit, match defined personalization facets to compute “personalization” scores
○ Use those scores as a new ranking criterion in the Tie-break formula
Query time personalization
Events
processing
TRIE storage Query Exec
31
Takeaways
─ Speed & Relevancy (incl. typo-tolerance) are both “easy”, mixing them is harder!
─ It’s a matter of trade-offs
─ Fast search-engine ≠ fast search experience: network latency!
─ Know your users: exhaustivity might be overrated!
─ Nothing is faster than your RAM (or CPU cache): make it fit!
─ You won’t have time: pre-compute it!
Takeaways
Q&A

Contenu connexe

Similaire à Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin)

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 

Similaire à Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin) (20)

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
AWS Techniques and lessons writing a minimal cost gitlab runner
AWS Techniques and lessons writing a minimal cost gitlab runnerAWS Techniques and lessons writing a minimal cost gitlab runner
AWS Techniques and lessons writing a minimal cost gitlab runner
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Solr the intelligent search engine
Solr the intelligent search engineSolr the intelligent search engine
Solr the intelligent search engine
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Friday 1484
Friday 1484Friday 1484
Friday 1484
 
Digital Marketing
Digital MarketingDigital Marketing
Digital Marketing
 
digital marketing
digital marketingdigital marketing
digital marketing
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Nlp model
Nlp modelNlp model
Nlp model
 
Big Data Usecases
Big Data UsecasesBig Data Usecases
Big Data Usecases
 
Lecture 01
Lecture 01Lecture 01
Lecture 01
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations Vision
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin)

  • 1. Fast & relevant search solutions and trade-offs Sylvain Utard & Marwan Burelle Engineering @ Algolia
  • 2. Agenda Introduction Relevant or and fast ? Implementation perspective 1 2 3
  • 4. ─ Senior Engineer @ Search Core ─ Focus on performance ─ Also ─ Lecturer & Researcher ─ Parallel prog., graph algorithms & malware classification Sylvain Utard ─ VP of Engineering ─ Joined them as their 1st employee ─ Leading a team of 100+ engineers ─ Also ─ Full-stack engineer (C++, Java, Ruby, JS/TS) ─ Part-time Textmining teacher Marwan Burelle
  • 5. ─ Search-as-a-service - REST Search API ─ 75B searches processed every month ─ 2000+ bare-metal servers (30k vCPUs, 300TB RAM, 2.5PB SSD) ─ Home-made search technology (mainly C++ & Nginx) ─ Company ─ Founded in 2013 ─ 6 offices (Paris, SF, London, NYC, Atlanta & Tokyo) ─ 350 people (inc. 100 engineers, 25 products) ─ Series C in 2019, $110M (total $184M) Algolia TLDR;
  • 6. 9k+ customers, 50k+ free users E-COMMERCE MEDIA & GAMING TECHNOLOGY / SAAS Financial Services Public Sector & Education Services
  • 7. ─ Focusing on a very specific subset of what search engines are used for ─ User-facing search bars ─ Searching structured data ─ E-Com products, categories ─ Social Network profiles ─ SAAS app objects ─ Relevance is key ─ Today: only Google, Amazon, Netflix, Facebook, Apple are really nailing it ─ Bar keeps getting higher & higher ─ Users now expect this fast & relevant experience everywhere! Consumer-grade search experience
  • 9. ─ Textmining/NLP research VS the reality ─ Mostly about searching unstructured data (the web or academic datasets) ─ Where the search engine owner doesn’t necessarily own/control the data to search in ─ Text relevance through statistics might not always be the best option ─ Social network: searching for a common name (ex: “Jonas Schmidt”) would be scored differently? ─ E-Com: searching for “iPhone” would score differently if “iPhone” is mentioned multiple times? Focus on relevance
  • 10. Focus on relevance ─ Algolia is leveraging a tie-breaking approach ─ 5 text-relevance criteria (inc. “attribute”) ─ 1 geo-relevance criterion ─ 1 filter-relevance criterion ─ Amend this list with ─ Your own business popularity ─ Your own score
  • 11. ─ Initially designed for mobile phones & search-as-you-type experiences ─ Prefix-search built-in (autocomplete) ─ Huge focus on typo-tolerance (fat fingers) ─ Had to run on an iPhone 3G (128MB RAM & 412Mhz) or Android equivalents ─ Prefix-search is easy ─ Typo-tolerant search is easy ─ Fast search is easy Focus on speed & exhaustivity: search-as-you-type Combining all while staying relevant is hard !!!
  • 12. Focus on speed: hardware & network ─ Fine-tuned hardware & OS ─ Bunch of kernel optimization (I/O scheduling, network & ram/disk buffers) ─ Packaged as a static library & linked within nginx ─ Separated build process running on the same machines ─ Clusters of 3 machines (min) for HA ─ Master-master replication ─ High-end servers for performance ─ Worldwide (built-in) replication ─ 70 datacenters across 16 regions ─ Horizontal scalability @ search, wherever your end-users are RAM: 64-128GB Proc: 8-12 cores, 3.2-3.8 GHz SSD: 2x300-800 GB Raid-0
  • 13. ─ Exhaustivity is usually not required ─ … or actually not used ─ Users are really looking at the top hits (only)! Focus on exhaustivity
  • 16. ● Technical choices: C++, nginx, infrastructures … ● Continuous performances evaluations ● Never stop looking for improvements Performance drives the dev
  • 17. ● Always try to minimize computations at query time ○ TRIE designed for efficient retrieval, prefix matching and typo-tolerance ○ Inverted lists approach: links words to document ids, uses compressed integers ○ Extra indexing: facets, empty query cache (top 1000 records) ... ● Index format mapped directly in memory ● As you need typo tolerance Performances drives design
  • 18. Anatomy of a query Prepare Index Query Parsing Collect hits Post-processing Formating ● Unavoidable but relatively fast ● We keep indices hot in memory to minimize slow start ● TRIE and inverted index lists ● Subject to timeouts ● Dynamic ranking, aggregation (distinct), faceting ... ● Most expensive part ● Can not be bound ● Depends on query paginations
  • 19. Only get what you need ● Most queries only require the first page, usually no more than 10 hits ● We limit hits retrieval in time and quantity ● Alternative forms (synonyms and typos) are searched only if we need more hits ● We use linear approximation to compute number of hits when reaching limits
  • 20. Only get what you need Results Set Enough hits ? Go deeper Use alternatives Accept more typos … Remove keywords
  • 21. ● Records are ranked at indexing time ● Internal document ids ordered by ranking ● Most relevant records are tops of the list ● Bounding search returns relevant hits Static ranking 5 53 99 204 237 402 507 661 662 666 803 13 53 101 204 237 408 507 666 803 990 1031 53 204 237 507 666 ∩ Retrieving the 5 best results for two keywords
  • 23. ● Compute basic stats on numerical facets (min, max, average) ● Based on all records satisfying the current query ● Do it for all present facets (or a subset of it) Sounds easy, isn’t it ? The problem maxmin
  • 24. ● We are schemaless ● Input format doesn’t constraint numeric types (integers, floats … ) ● We may have way more matching hits than retrieved ones ● Can’t easily identify numeric facets ● Requires a lot of conversions ● We need to scan all matching records Constraints Consequences The problem
  • 25. Current solution: ● Only compute stats for the first thousand hits ● Cost only depends on the number of facets ● Fast for reasonable number of facets ● Works great in most cases ! Sampling
  • 26. ● Static ranking may not correlate with all facets ○ Extreme values (min/max) may not belong to the samples ○ The samples may not be representative for average computation ● Still expensive with a large number of facets ● Most expensive operations: ○ Checking if a facet is numeric ○ Converting values to appropriate representation Sampling
  • 27. New version: ● Pre-computed statistics for empty queries ● Extra indexing information ○ Which facets are numeric ○ Decoded value in suitable binary representation => 30% faster at query time for less than 3% of extra indexing time => No more inconsistent results Pre-computing
  • 29. ─ Events are sent to an ingestion API Clicks Conversions Views Events ingestion & analytics Query ID User/Group Token
  • 30. Index ● Personalization profiles/scores indexed asynchronously ● Personalization profiles/scores stored using a TRIE for quick retrieval ● Ranking impact, if the query is providing a user/group Token: ○ For each hit, match defined personalization facets to compute “personalization” scores ○ Use those scores as a new ranking criterion in the Tie-break formula Query time personalization Events processing TRIE storage Query Exec
  • 32. ─ Speed & Relevancy (incl. typo-tolerance) are both “easy”, mixing them is harder! ─ It’s a matter of trade-offs ─ Fast search-engine ≠ fast search experience: network latency! ─ Know your users: exhaustivity might be overrated! ─ Nothing is faster than your RAM (or CPU cache): make it fit! ─ You won’t have time: pre-compute it! Takeaways
  • 33. Q&A