5. Technical Stack
____________________________
DropWizard as a service framework (incl. Jetty, Jersey, Jackson)
ZooKeeper (via Smartstack) for service discovery.
Lucene for index storage and simple retrieval.
In-house built forward index, real-time indexing, ranking,
advanced filtering.
6. Web App
Search1
150 Search Threads
Lucene Index
~30 replicas of same index
dataJVM
…Search2 SearchN
Search
Overview
9. What’s in the Lucene index?
____________________________
Positions of listings indexed using Lucene’s spatial module
(RecursivePrefixTreeStrategy)
Categorical and numerical properties like room type and maximum occupancy
Full text (descriptions, reviews, etc.)
~40 fields per listing from a variety of data sources, all updated in real time
11. Tails binary update logs from Mysql Servers (5.6+)
Converts changes in any of the tables into actionable objects called
“Mutations” (Inserts, deletes, Updates)
Broadcasts them to Medusa using Kafka
Spinaltap
13. Source of truth for search index data.
Listens to updates from Spinaltap and builds new IndexData by
querying ~15 mysql tables from three different databases.
Persists everything in a DataStore and broadcasts latest version to all
search nodes.
Uses ZooKeeper for leader election.
Medusa
15. What’s in the forward index?
____________________________
Holds all the metadata about a listing required by
scoring and filtering.
We also have complicated business rules to calculate
Price, Availability, InstantBook etc which needs a ton of
metadata.
~50 fields built from multiple data source and updated
in realtime.
public final class ForwardIndexData {
private final CalendarData calendarData;
private final PricingData pricingData;
private final HostInfo hostInfo;
. . . .
. . . .
}
!
public final class CalendarData {
private final DateRanges reservationDates;
private final SeasonalValues startDayOfWeeks;
. . . .
}
!
private final class SeasonalValues<T> {
private final DateRange startDate;
private final T value;
. . . .
}
Forward Index
16. Availability
____________________________
!
Depends on the profile of guest.
The checkin date must be one of the valid start days of the week.
Must satisfy seasonal minimum nights.
There must be enough preparation time for the host.
Import busy dates from external calendars to avoid booking conflict.
17. Pricing
____________________________
!
Depends on number of guests , number of nights.
How close or further away the checkin date is.
How long is the trip, does the host have Weekly and Monthly pricing.
Is there special price override for these nights.
19. Needs to store objects with 50-100 fields as values keyed by listing id.
Should avoid the cost of serialization/deserialization during every fetch.
Data must be available in-memory for fast lookup, but also
persisted on disk.
Highly Concurrent, writer shouldn’t block the readers (One writer
but >100 reader threads)
Requirements
Why did we need our custom Forward Index?
20. // Forward Index
public interface ForwardIndex<V> {
!
Map<Long, V> asMap();
void put(long id, V value);
!
void putAll(Map<Long, V> values);
!
void remove(long id);
!
void commit();
!
}
Forward Index Interface
// Writer
forwardIndex.put(listingId, listingData);
. . .
// write to disk and also make it visible to readers.
forwardIndex.commit();
// Reader
// Fetch forward index data from in-memory map
Map<Long, ListingData> fwdIndex = forwardIndex.asMap();
ListingData data = fwdIndex.get(listingId);
!
// Use it to evaluate business rules
checkAvailability(data, searchRequest);
calculatePrice(data, searchRequest)
21. NonBlocking In-Memory
HashMap
DiskStore
// Forward Index
public class ForwardIndexStore<V> implements ForwardIndex<V> {
private final DB<V> diskStore;
private final Cache<V> cache;
!
. . . .
!
@Override
Map<Long, V> asMap() {
return Collections.unmodifiableMap(cache);
}
void put(long id, V value) {
diskStore.put(id, value);
cache.put(id, value);
}
!
. . . .
!
void commit() {
diskStore.commit();
cache.commit();
}
}
Forward Index Implementation
22. Ranking Problem
____________________________
Not a text search problem
Users are almost never searching for a specific item, rather they’re looking to
“Discover”
The most common component of a query is location
Highly personalized – the user is a part of the query
Optimizing for conversion (Search -> Inquiry -> Booking)
Evolution through continuous experimentation
Ranking
24. Several hundred signals used to build
machine learning models:
!
Properties of the listing (reviews, location, etc.)
Behavioral signals (mined from request logs)
Image quality and click ability (computer vision)
Host behavior (response time/rate, cancellations, etc.)
Host preferences model
DB snapshots Logs
25. Life of a Query
Query Understanding
Retrieval Populator
First Pass Scorer
Geocoding
Configuring retrieval options
Choosing ranking models
Quality
Bookability
Relevance
Second Pass Ranking
Result Generation AirEvents
Filtering by Price and
Availability
25 results
2000 results
25 results
27. Life of a Query
Query Understanding
Retrieval Populator
First Pass Scorer
Geocoding
Configuring retrieval options
Choosing ranking models
Quality
Bookability
Relevance
Second Pass Ranking
Result Generation AirEvents
Filtering by Price and
Availability
25 results
2000 results
25 results