2. Best travel experiences powered
by data products
Inform decision making based on
data and insights from data
2
3. • ML applications
-Fraud detection, Search ranking, etc.
• User activity
-Growth, matching, etc.
• Experimentation, monitoring, etc.
Events
Lead to
Insights
3
Events
Insights
Production Data Warehouse
4. • JSON events without schemas
• Over 800+ event types
• Easy to break events during evolution/code changes
• Lack of monitoring
Lead to:
• Too many data outages, data loss incidents
• Lack of trust on data systems
Challenges
1.5 Years Ago
6. Data Quality Failure
ERF was unstable and
experimentation culture
was weak
Hi team,
This is partly a PSA to let you
know ERF dashboard data
hasn't been up to date/
accurate for several weeks
now. Do not rely on the ERF
dashboard for information
about your experiment.
1.5 Years Ago
8. • Timeliness
-Land on time; be predictable
• Completeness
-All data should land in the warehouse
• Data Quality
-Identify anomalous behavior
Reliability
Guarantees
Targeted
8
10. • More users, activity, bookings, etc.
• Need lightweight techniques that are themselves not
bottlenecks
Rapid Growth in
Events Data
1/8/14%
2/8/14%
3/8/14%
4/8/14%
5/8/14%
6/8/14%
7/8/14%
8/8/14%
9/8/14%
10/8/14%
11/8/14%
12/8/14%
1/8/15%
2/8/15%
3/8/15%
4/8/15%
5/8/15%
6/8/15%
7/8/15%
8/8/15%
9/8/15%
10
11. • How many events were actually emitted?
• How many must have been emitted?
• What should be in the correct data?
• How to catch subtle anomalies in data?
No Ground Truth
11
14. Instrumentation, monitoring, alerting on each component
• Process health
• Count of input/output events
• Week-over-week comparison
Guarding
Against
Component
Failures
14
17. Hardening each component is not sufficient
• Account for new failure modes
• Quantify aggregate event loss
• Narrow down the source of loss
• Need end-to-end and out-of-band checks on the full
pipeline
E2E System
Auditing
17
18. Canary Service
• A standalone service that sends events at a known rate
• Compare events landed in warehouse and alert on loss
• Simple, reliable, and accurate
18
19. DB as Proxy for
Ground Truth
• Compare DB mutations with corresponding events emitted
• DB serves as a ground truth for events with 1:1 mapping
19
20. Audit Pipeline
Overview
• Need to quantify event loss and ensure SLA is not violated
• Attach a header to each event when it enters the pipeline:
REST proxy, Java, and Ruby
• Header contains host, process, sequence, and uuid
• Group sequence by (host, process) in warehouse: quantify
event loss, and attribute loss to hosts
• Extend to multi-hop sequence: easy to attribute loss to
internal component in the pipeline
20
24. • JSON events without schemas
• Easy to break events during evolution/code changes
• Over 800+ event types
• Lack of monitoring
Lead to:
• Too many data outages, data loss incidents
• Lack of trust on data systems
Challenges
1.5 Years Ago
26. Schema
Enforcement
• Schema tech stack: Thrift
• Libraries for sending thrift objects from different clients:
Java, Ruby, JS, and Mobile
• Who should define schemas: data scientist or product
engineer
• Development workflow: schema evolution, and bridge
producer and consumer schemas
• Self-serve
26
27. Thrift Schema Repository
Why Thrift?
• Easy syntax
• Good performance in Ruby
• Ubiquitous
Advantages of schema repo?
• Great Catalyst for communication, documentation, etc
• it ships jar and gems
• Will developers hate you for this? no
28. • Standard Field in the event schema
• Managed Explicitly
• use Semantic Versioning:
1.0.0 = MODEL . REVISION . ADDITION
MODEL is a change which breaks the rules of backward
compatibility.
Example: changing the type of a field.
REVISION is a change which is backward compatible but not
forward compatible.
Example: adding a new field to a union type.
ADDITION is a change which is both backward compatible and
forward compatible.
Example: adding a new optional field.
Schema
Evolution
33. A Bad Date
Picker
33
• On 9/22/2015, we launched a new Datepicker experiment on
P1
• Half of users received new_datepicker treatment, the other
half were control
• It was shut off by 9/29/2015, and metrics recovered
34. Diagnosis
34
• We realize a 14% drop in “searches with dates” after about
7 days
• The scope of the impact was unclear; we just know a
subset of locales were affected
• The root cause analysis depends heavily on vigilance and a
bit of guesswork / luck
• Drilling down by Country revealed an interesting pattern
35. Diagnosis
35
• Drilling down into source = P1, we see a stronger pattern
• Something qualitatively worse is happening in IT, GB, and
CA
• “Affected locales: en-GB, it, en-AU, en-CA, en-NZ, da, zh-tw,
ms-my and probably some more”
• How did we know to try P1?
• How to know which countries to slice by?
36. Curiosity
36
• Let’s automate this process!
• It’s hard to know which dimension combinations matter…
...so try as many of them as we reasonably can, in an intelligent way
• Drill-down into dimension combinations that are
Specific enough to be informative,
Yet still contribute meaningfully to top-level aggregate
37. Method
37
• Retrieve time series data from a source (GROUP BY time, dimension)
• Detect any anomalies in each dimension value’s time series
• Explore across the dimension space to compare values against each other
• Prune the set of dimension values using anomalies / exploration
• Drill-down into remaining dimensions for pruned values
42. Hive Hbase
Connector
42
CREATE EXTERNAL TABLE `search_event_table` (
`rowkey` string COMMENT 'from deserializer',
`event_bytes` binary COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.airbnb.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.airbnb.HBaseStorageHandler'
WITH SERDEPROPERTIES (
‘hbase.timerange.hourly.boundary'='true', // for current hour
'hbase.columns.mapping'=':key, b:event_bytes',
‘hbase.key.pushdown’=‘jitney_event.search_event',
‘hbase.timestamp.min’=‘…', // arbitrary time range start
‘hbase.timestamp.max’=‘…') // arbitrary time range end
43. • Ingest over 5B events with less than 100 events/day loss
• We can alert on data loss in real time (loss > 0.01%)
• We can quantify which machine/service lead to how much loss
• We can identify even subtle anomalies in the data
Conclusions
43