Implementing a fast search experience is possible, implementing a relevant search experience is attainable; but having both at the same time is hard. In this presentation, we will explain how Algolia mixes both, what are the performance impact on some relevancy requirements and what are the trade-offs made.
4. ─ Senior Engineer @ Search Core
─ Focus on performance
─ Also
─ Lecturer & Researcher
─ Parallel prog., graph algorithms & malware classification
Sylvain Utard ─ VP of Engineering
─ Joined them as their 1st employee
─ Leading a team of 100+ engineers
─ Also
─ Full-stack engineer (C++, Java, Ruby, JS/TS)
─ Part-time Textmining teacher
Marwan Burelle
5. ─ Search-as-a-service - REST Search API
─ 75B searches processed every month
─ 2000+ bare-metal servers (30k vCPUs, 300TB RAM, 2.5PB SSD)
─ Home-made search technology (mainly C++ & Nginx)
─ Company
─ Founded in 2013
─ 6 offices (Paris, SF, London, NYC, Atlanta & Tokyo)
─ 350 people (inc. 100 engineers, 25 products)
─ Series C in 2019, $110M (total $184M)
Algolia TLDR;
6. 9k+ customers, 50k+ free users
E-COMMERCE MEDIA & GAMING TECHNOLOGY / SAAS
Financial Services Public Sector & Education Services
7. ─ Focusing on a very specific subset of what search engines are used for
─ User-facing search bars
─ Searching structured data
─ E-Com products, categories
─ Social Network profiles
─ SAAS app objects
─ Relevance is key
─ Today: only Google, Amazon, Netflix, Facebook, Apple are really nailing it
─ Bar keeps getting higher & higher
─ Users now expect this fast & relevant experience everywhere!
Consumer-grade search experience
9. ─ Textmining/NLP research VS the reality
─ Mostly about searching unstructured data (the web or academic datasets)
─ Where the search engine owner doesn’t necessarily own/control the data to search in
─ Text relevance through statistics might not always be the best option
─ Social network: searching for a common name (ex: “Jonas Schmidt”) would be scored differently?
─ E-Com: searching for “iPhone” would score differently if “iPhone” is mentioned multiple times?
Focus on relevance
10. Focus on relevance
─ Algolia is leveraging a tie-breaking
approach
─ 5 text-relevance criteria (inc. “attribute”)
─ 1 geo-relevance criterion
─ 1 filter-relevance criterion
─ Amend this list with
─ Your own business popularity
─ Your own score
11. ─ Initially designed for mobile phones & search-as-you-type experiences
─ Prefix-search built-in (autocomplete)
─ Huge focus on typo-tolerance (fat fingers)
─ Had to run on an iPhone 3G (128MB RAM & 412Mhz) or Android equivalents
─ Prefix-search is easy
─ Typo-tolerant search is easy
─ Fast search is easy
Focus on speed & exhaustivity:
search-as-you-type
Combining all while staying relevant is hard !!!
12. Focus on speed: hardware & network
─ Fine-tuned hardware & OS
─ Bunch of kernel optimization (I/O scheduling, network & ram/disk buffers)
─ Packaged as a static library & linked within nginx
─ Separated build process running on the same machines
─ Clusters of 3 machines (min) for HA
─ Master-master replication
─ High-end servers for performance
─ Worldwide (built-in) replication
─ 70 datacenters across 16 regions
─ Horizontal scalability @ search, wherever your end-users are
RAM: 64-128GB
Proc: 8-12 cores, 3.2-3.8 GHz
SSD: 2x300-800 GB Raid-0
13. ─ Exhaustivity is usually not required
─ … or actually not used
─ Users are really looking at the top hits (only)!
Focus on exhaustivity
16. ● Technical choices: C++, nginx, infrastructures …
● Continuous performances evaluations
● Never stop looking for improvements
Performance drives the dev
17. ● Always try to minimize computations at query time
○ TRIE designed for efficient retrieval, prefix matching and typo-tolerance
○ Inverted lists approach: links words to document ids, uses compressed integers
○ Extra indexing: facets, empty query cache (top 1000 records) ...
● Index format mapped directly in memory
● As you need typo tolerance
Performances drives design
18. Anatomy of a query
Prepare Index
Query Parsing
Collect hits
Post-processing
Formating
● Unavoidable but relatively fast
● We keep indices hot in memory to minimize slow start
● TRIE and inverted index lists
● Subject to timeouts
● Dynamic ranking, aggregation (distinct), faceting ...
● Most expensive part
● Can not be bound
● Depends on query paginations
19. Only get what you need
● Most queries only require the first page, usually no more than 10 hits
● We limit hits retrieval in time and quantity
● Alternative forms (synonyms and typos) are searched only if we need more hits
● We use linear approximation to compute number of hits when reaching limits
20. Only get what you need
Results Set Enough
hits ?
Go deeper
Use alternatives
Accept more typos
…
Remove keywords
21. ● Records are ranked at indexing time
● Internal document ids ordered by ranking
● Most relevant records are tops of the list
● Bounding search returns relevant hits
Static ranking
5
53
99
204
237
402
507
661
662
666
803
13
53
101
204
237
408
507
666
803
990
1031
53
204
237
507
666
∩
Retrieving the 5
best results for two
keywords
23. ● Compute basic stats on numerical facets (min, max, average)
● Based on all records satisfying the current query
● Do it for all present facets (or a subset of it)
Sounds easy, isn’t it ?
The problem
maxmin
24. ● We are schemaless
● Input format doesn’t constraint
numeric types (integers, floats … )
● We may have way more matching
hits than retrieved ones
● Can’t easily identify numeric
facets
● Requires a lot of conversions
● We need to scan all matching
records
Constraints Consequences
The problem
25. Current solution:
● Only compute stats for the first thousand hits
● Cost only depends on the number of facets
● Fast for reasonable number of facets
● Works great in most cases !
Sampling
26. ● Static ranking may not correlate with all facets
○ Extreme values (min/max) may not belong to the samples
○ The samples may not be representative for average computation
● Still expensive with a large number of facets
● Most expensive operations:
○ Checking if a facet is numeric
○ Converting values to appropriate representation
Sampling
27. New version:
● Pre-computed statistics for empty queries
● Extra indexing information
○ Which facets are numeric
○ Decoded value in suitable binary representation
=> 30% faster at query time for less than 3% of extra indexing time
=> No more inconsistent results
Pre-computing
29. ─ Events are sent to an ingestion API
Clicks
Conversions
Views
Events ingestion & analytics
Query ID User/Group Token
30. Index
● Personalization profiles/scores indexed asynchronously
● Personalization profiles/scores stored using a TRIE for quick retrieval
● Ranking impact, if the query is providing a user/group Token:
○ For each hit, match defined personalization facets to compute “personalization” scores
○ Use those scores as a new ranking criterion in the Tie-break formula
Query time personalization
Events
processing
TRIE storage Query Exec
32. ─ Speed & Relevancy (incl. typo-tolerance) are both “easy”, mixing them is harder!
─ It’s a matter of trade-offs
─ Fast search-engine ≠ fast search experience: network latency!
─ Know your users: exhaustivity might be overrated!
─ Nothing is faster than your RAM (or CPU cache): make it fit!
─ You won’t have time: pre-compute it!
Takeaways