Reliable and Scalable Data Ingestion at Airbnb

Reliable and Scalable
Data Ingestion at Airbnb
KRISHNA PUTTASWAMY & JASON ZHANG
1

Best travel experiences powered
by data products
Inform decision making based on
data and insights from data
2

• ML applications
-Fraud detection, Search ranking, etc.
• User activity
-Growth, matching, etc.
• Experimentation, monitoring, etc.
Events
Lead to
Insights
3
Events
Insights
Production Data Warehouse

• JSON events without schemas
• Over 800+ event types
• Easy to break events during evolution/code changes
• Lack of monitoring
Lead to:
• Too many data outages, data loss incidents
• Lack of trust on data systems
Challenges
1.5 Years Ago

Data Quality Failure
CEO dashboard and
Bookings dashboards
were regularly broken.
1.5 Years Ago

Data Quality Failure
ERF was unstable and
experimentation culture
was weak
Hi team,
This is partly a PSA to let you
know ERF dashboard data
hasn't been up to date/
accurate for several weeks
now. Do not rely on the ERF
dashboard for information
about your experiment.
1.5 Years Ago

Events Data Ingestion
Must be Reliable
7

• Timeliness
-Land on time; be predictable
• Completeness
-All data should land in the warehouse
• Data Quality
-Identify anomalous behavior
Reliability
Guarantees
Targeted
8

Kafka
Camus
EZSplit
HDFS
Ruby
Java
Javascript
Mobiles
Data
Pipelines
Data
Products
REST
Proxy
Kafka
Client
Kafka
Client
Kafka
Client
Invalid
data
Stuck
processe
Buffer
overflow
Node
failures
Host
network
Broker
errors
Distributed
Systems
9

• More users, activity, bookings, etc.
• Need lightweight techniques that are themselves not
bottlenecks
Rapid Growth in
Events Data
1/8/14%
2/8/14%
3/8/14%
4/8/14%
5/8/14%
6/8/14%
7/8/14%
8/8/14%
9/8/14%
10/8/14%
11/8/14%
12/8/14%
1/8/15%
2/8/15%
3/8/15%
4/8/15%
5/8/15%
6/8/15%
7/8/15%
8/8/15%
9/8/15%
10

• How many events were actually emitted?
• How many must have been emitted?
• What should be in the correct data?
• How to catch subtle anomalies in data?
No Ground Truth
11

E2E Audit
Schema
Enforcement
Anomaly
Detection
Component Level
Audit
Realtime
Ingestion
Phases of
Rebuilding Data
Ingestion

Phase 1: Audit each component
13

Instrumentation, monitoring, alerting on each component
• Process health
• Count of input/output events
• Week-over-week comparison
Guarding
Against
Component
Failures
14

Kafka
Camus
EZSplit
HDFS
Ruby
Java
Javascript
Mobiles
Data
Pipelines
Data
Products
REST
Proxy
Kafka
Client
Kafka
Client
Kafka
Client
Stuck
processe
Buffer
overflow
Node
failures
Host
network
Broker
errors
Pipeline
bug
15

Hardening each component is not sufficient
• Account for new failure modes
• Quantify aggregate event loss
• Narrow down the source of loss
• Need end-to-end and out-of-band checks on the full
pipeline
E2E System
Auditing
17

Canary Service
• A standalone service that sends events at a known rate
• Compare events landed in warehouse and alert on loss
• Simple, reliable, and accurate
18

DB as Proxy for
Ground Truth
• Compare DB mutations with corresponding events emitted
• DB serves as a ground truth for events with 1:1 mapping
19

Audit Pipeline
Overview
• Need to quantify event loss and ensure SLA is not violated
• Attach a header to each event when it enters the pipeline:
REST proxy, Java, and Ruby
• Header contains host, process, sequence, and uuid
• Group sequence by (host, process) in warehouse: quantify
event loss, and attribute loss to hosts
• Extend to multi-hop sequence: easy to attribute loss to
internal component in the pipeline
20

Event Schema for Audit
Metadata

Kafka
Camus
EZSplit
HDFS
Ruby
Java
Javascript
Mobiles
Data
Pipelines
Site-
facing
Services
REST
Proxy
Kafka
Client
Kafka
Client
Kafka
Client
124
124
3
canary
service
database
snapshot
22

Phase 3: Schema enforcement
23

• JSON events without schemas
• Easy to break events during evolution/code changes
• Over 800+ event types
• Lack of monitoring
Lead to:
• Too many data outages, data loss incidents
• Lack of trust on data systems
Challenges
1.5 Years Ago

Schema
Enforcement
• Schema tech stack: Thrift
• Libraries for sending thrift objects from different clients:
Java, Ruby, JS, and Mobile
• Who should define schemas: data scientist or product
engineer
• Development workflow: schema evolution, and bridge
producer and consumer schemas
• Self-serve
26

Thrift Schema Repository
Why Thrift?
• Easy syntax
• Good performance in Ruby
• Ubiquitous
Advantages of schema repo?
• Great Catalyst for communication, documentation, etc
• it ships jar and gems
• Will developers hate you for this? no

• Standard Field in the event schema
• Managed Explicitly
• use Semantic Versioning:
1.0.0 = MODEL . REVISION . ADDITION
MODEL is a change which breaks the rules of backward
compatibility.
Example: changing the type of a field.
REVISION is a change which is backward compatible but not
forward compatible.
Example: adding a new field to a union type.
ADDITION is a change which is both backward compatible and
forward compatible.
Example: adding a new optional field.
Schema
Evolution

Example of Thrift Event
because the event is your API

31
Example Schema Mapping in the Warehouse

A Bad Date
Picker
33
• On 9/22/2015, we launched a new Datepicker experiment on
P1
• Half of users received new_datepicker treatment, the other
half were control
• It was shut off by 9/29/2015, and metrics recovered

Diagnosis
34
• We realize a 14% drop in “searches with dates” after about
7 days
• The scope of the impact was unclear; we just know a
subset of locales were affected
• The root cause analysis depends heavily on vigilance and a
bit of guesswork / luck
• Drilling down by Country revealed an interesting pattern

Diagnosis
35
• Drilling down into source = P1, we see a stronger pattern
• Something qualitatively worse is happening in IT, GB, and
CA
• “Affected locales: en-GB, it, en-AU, en-CA, en-NZ, da, zh-tw,
ms-my and probably some more”
• How did we know to try P1?
• How to know which countries to slice by?

Curiosity
36
• Let’s automate this process!
• It’s hard to know which dimension combinations matter…
...so try as many of them as we reasonably can, in an intelligent way
• Drill-down into dimension combinations that are
Specific enough to be informative,
Yet still contribute meaningfully to top-level aggregate

Method
37
• Retrieve time series data from a source (GROUP BY time, dimension)
• Detect any anomalies in each dimension value’s time series
• Explore across the dimension space to compare values against each other
• Prune the set of dimension values using anomalies / exploration
• Drill-down into remaining dimensions for pruned values

Phase 5: Realtime ingestion
38

Streaming Ingestion Pipeline
end to end

HBase Row Key
• Event key = event_type.event_name.event_uuid.
Ex: air_event.canaryevent.016230ae-a3d8-434e
• Shard id = Hash (Event key) % Shard_num
• Shard key = Region_start_keys[Shard_id].
Ex: 0000000
• Row key = Shard_key.Event_key.
Ex: 0000000.air_events.canaryevent.016230-a3db-434e
40

Dedup and Repartition
41
Spark
Executor1
…
HBase
Region1
…
Executor2
ExecutorN
Region2
RegionM
RegionK
…

Hive Hbase
Connector
42
CREATE EXTERNAL TABLE `search_event_table` (
`rowkey` string COMMENT 'from deserializer',
`event_bytes` binary COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.airbnb.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.airbnb.HBaseStorageHandler'
WITH SERDEPROPERTIES (
‘hbase.timerange.hourly.boundary'='true', // for current hour
'hbase.columns.mapping'=':key, b:event_bytes',
‘hbase.key.pushdown’=‘jitney_event.search_event',
‘hbase.timestamp.min’=‘…', // arbitrary time range start
‘hbase.timestamp.max’=‘…') // arbitrary time range end

• Ingest over 5B events with less than 100 events/day loss
• We can alert on data loss in real time (loss > 0.01%)
• We can quantify which machine/service lead to how much loss
• We can identify even subtle anomalies in the data
Conclusions
43

Reliable and Scalable Data Ingestion at Airbnb

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Reliable and Scalable Data Ingestion at Airbnb

Similaire à Reliable and Scalable Data Ingestion at Airbnb (20)

Plus de DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Dernier

Dernier (20)

Reliable and Scalable Data Ingestion at Airbnb