Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin)

Fast & relevant search
solutions and trade-offs
Sylvain Utard & Marwan Burelle
Engineering @ Algolia

Agenda
Introduction
Relevant or and fast ?
Implementation perspective
1
2
3

─ Senior Engineer @ Search Core
─ Focus on performance
─ Also
─ Lecturer & Researcher
─ Parallel prog., graph algorithms & malware classiﬁcation
Sylvain Utard ─ VP of Engineering
─ Joined them as their 1st employee
─ Leading a team of 100+ engineers
─ Also
─ Full-stack engineer (C++, Java, Ruby, JS/TS)
─ Part-time Textmining teacher
Marwan Burelle

─ Search-as-a-service - REST Search API
─ 75B searches processed every month
─ 2000+ bare-metal servers (30k vCPUs, 300TB RAM, 2.5PB SSD)
─ Home-made search technology (mainly C++ & Nginx)
─ Company
─ Founded in 2013
─ 6 offices (Paris, SF, London, NYC, Atlanta & Tokyo)
─ 350 people (inc. 100 engineers, 25 products)
─ Series C in 2019, $110M (total $184M)
Algolia TLDR;

9k+ customers, 50k+ free users
E-COMMERCE MEDIA & GAMING TECHNOLOGY / SAAS
Financial Services Public Sector & Education Services

─ Focusing on a very specific subset of what search engines are used for
─ User-facing search bars
─ Searching structured data
─ E-Com products, categories
─ Social Network profiles
─ SAAS app objects
─ Relevance is key
─ Today: only Google, Amazon, Netflix, Facebook, Apple are really nailing it
─ Bar keeps getting higher & higher
─ Users now expect this fast & relevant experience everywhere!
Consumer-grade search experience

─ Textmining/NLP research VS the reality
─ Mostly about searching unstructured data (the web or academic datasets)
─ Where the search engine owner doesn’t necessarily own/control the data to search in
─ Text relevance through statistics might not always be the best option
─ Social network: searching for a common name (ex: “Jonas Schmidt”) would be scored differently?
─ E-Com: searching for “iPhone” would score differently if “iPhone” is mentioned multiple times?
Focus on relevance

Focus on relevance
─ Algolia is leveraging a tie-breaking
approach
─ 5 text-relevance criteria (inc. “attribute”)
─ 1 geo-relevance criterion
─ 1 ﬁlter-relevance criterion
─ Amend this list with
─ Your own business popularity
─ Your own score

─ Initially designed for mobile phones & search-as-you-type experiences
─ Prefix-search built-in (autocomplete)
─ Huge focus on typo-tolerance (fat fingers)
─ Had to run on an iPhone 3G (128MB RAM & 412Mhz) or Android equivalents
─ Prefix-search is easy
─ Typo-tolerant search is easy
─ Fast search is easy
Focus on speed & exhaustivity:
search-as-you-type
Combining all while staying relevant is hard !!!

Focus on speed: hardware & network
─ Fine-tuned hardware & OS
─ Bunch of kernel optimization (I/O scheduling, network & ram/disk buffers)
─ Packaged as a static library & linked within nginx
─ Separated build process running on the same machines
─ Clusters of 3 machines (min) for HA
─ Master-master replication
─ High-end servers for performance
─ Worldwide (built-in) replication
─ 70 datacenters across 16 regions
─ Horizontal scalability @ search, wherever your end-users are
RAM: 64-128GB
Proc: 8-12 cores, 3.2-3.8 GHz
SSD: 2x300-800 GB Raid-0

─ Exhaustivity is usually not required
─ … or actually not used
─ Users are really looking at the top hits (only)!
Focus on exhaustivity

● Technical choices: C++, nginx, infrastructures …
● Continuous performances evaluations
● Never stop looking for improvements
Performance drives the dev

● Always try to minimize computations at query time
○ TRIE designed for efficient retrieval, preﬁx matching and typo-tolerance
○ Inverted lists approach: links words to document ids, uses compressed integers
○ Extra indexing: facets, empty query cache (top 1000 records) ...
● Index format mapped directly in memory
● As you need typo tolerance
Performances drives design

Anatomy of a query
Prepare Index
Query Parsing
Collect hits
Post-processing
Formating
● Unavoidable but relatively fast
● We keep indices hot in memory to minimize slow start
● TRIE and inverted index lists
● Subject to timeouts
● Dynamic ranking, aggregation (distinct), faceting ...
● Most expensive part
● Can not be bound
● Depends on query paginations

Only get what you need
● Most queries only require the ﬁrst page, usually no more than 10 hits
● We limit hits retrieval in time and quantity
● Alternative forms (synonyms and typos) are searched only if we need more hits
● We use linear approximation to compute number of hits when reaching limits

Only get what you need
Results Set Enough
hits ?
Go deeper
Use alternatives
Accept more typos
…
Remove keywords

● Records are ranked at indexing time
● Internal document ids ordered by ranking
● Most relevant records are tops of the list
● Bounding search returns relevant hits
Static ranking
5
53
99
204
237
402
507
661
662
666
803
13
53
101
204
237
408
507
666
803
990
1031
53
204
237
507
666
∩
Retrieving the 5
best results for two
keywords

● Compute basic stats on numerical facets (min, max, average)
● Based on all records satisfying the current query
● Do it for all present facets (or a subset of it)
Sounds easy, isn’t it ?
The problem
maxmin

● We are schemaless
● Input format doesn’t constraint
numeric types (integers, ﬂoats … )
● We may have way more matching
hits than retrieved ones
● Can’t easily identify numeric
facets
● Requires a lot of conversions
● We need to scan all matching
records
Constraints Consequences
The problem

Current solution:
● Only compute stats for the ﬁrst thousand hits
● Cost only depends on the number of facets
● Fast for reasonable number of facets
● Works great in most cases !
Sampling

● Static ranking may not correlate with all facets
○ Extreme values (min/max) may not belong to the samples
○ The samples may not be representative for average computation
● Still expensive with a large number of facets
● Most expensive operations:
○ Checking if a facet is numeric
○ Converting values to appropriate representation
Sampling

New version:
● Pre-computed statistics for empty queries
● Extra indexing information
○ Which facets are numeric
○ Decoded value in suitable binary representation
=> 30% faster at query time for less than 3% of extra indexing time
=> No more inconsistent results
Pre-computing

─ Events are sent to an ingestion API
Clicks
Conversions
Views
Events ingestion & analytics
Query ID User/Group Token

Index
● Personalization profiles/scores indexed asynchronously
● Personalization profiles/scores stored using a TRIE for quick retrieval
● Ranking impact, if the query is providing a user/group Token:
○ For each hit, match defined personalization facets to compute “personalization” scores
○ Use those scores as a new ranking criterion in the Tie-break formula
Query time personalization
Events
processing
TRIE storage Query Exec

─ Speed & Relevancy (incl. typo-tolerance) are both “easy”, mixing them is harder!
─ It’s a matter of trade-offs
─ Fast search-engine ≠ fast search experience: network latency!
─ Know your users: exhaustivity might be overrated!
─ Nothing is faster than your RAM (or CPU cache): make it ﬁt!
─ You won’t have time: pre-compute it!
Takeaways

Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin)

Recommandé

Recommandé

Contenu connexe

Similaire à Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin)

Similaire à Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin) (20)

Dernier

Dernier (20)

Fast & relevant search: solutions and trade-offs (January 2020 - Search Technology meetup Berlin)