Flink Forward San Francisco 2019: Building production Flink jobs with Airstream at Airbnb - Pala Muthiah & Hao Wang

Writing production
Flink jobs with
Airstream
PALA MUTHIAH
HAO WANG
AIRBNB INC
FLINK FORWARD 2019

Agenda
• Goal
• What is Airstream
• Components of a pipeline
• Architecture
• Use cases
• Deep dive of an example
• Future plans

Goal:
Make authoring data
pipelines easy for product
engineers and data
scientists

What is Airstream
• A framework to define and execute data pipelines
• Pipelines created by stitching together building blocks
• Pipelines defined through configuration
• Philosophy: Make simple things easy, complex things possible

Why Flink
• Low latency streaming
• Full SQL support
• Stability - Battle tested
• Adoption within the industry

Components of a pipeline
• Source
• Process
• Sink
Sources Processes Sinks

SOURCES
• Structured and unstructured
Jitney
• Static and dynamic data sources
Kafka, HDFS etc

JITNEY SOURCE
• Structured and versioned messages
• Majority of events in Airbnb online services are published as Jitney messages
• Enables SQL: Thrift schema is translated into SQL Schema
Thrift
Message 1:
Field1: Int
Field2: String
Field3: Nested Struct
Field4: Collection
….
SQL
Table 1:
Field 1 Field 2 Field 3 …

PROCESS
• Unit of logic
• SQL on structured messages
• Custom UDFs on unstructured messages

SQL PROCESS ON STRUCTURED MESSAGES
Write SQL on the converted Thrift Schema
Thrift
Message1:
Field1: Int
Field2: String
Field3: Nested Struct
Field4: Collection
….
SQL
Table 1:
Field 1 Field 2 Field 3 …
SQL
select * from table1
where field1 = 2003
and field3.nestedval1 =
‘booking’ …

PROCESS: CUSTOM USER DEFINED FUNCTION
• Useful for arbitrary logic not expressible through SQL
• Logic with side effects. e.g: State machine implemented with external storage
• Source is usually kept unstructured when it is processed by custom UDFs

Component of a pipeline
SINKS
• Persist pipeline output
• Variety of sinks
Kafka
Jitney
HDFS
HTTP
Metrics (Datadog)

Putting it together
SAMPLE CONFIGURATION
config.checkpoint = false
source = [
{
name: topic1,
type: kafka,
config: {
topic: …,
broker: …
serde: jitney,
eventClassName: “com.logging.…”,
…
}
},
{
name: topic2,
type: kafka,
config: {
topic: …,
broker: …
serde: jitney,
eventClassName: “com.logging…”,
…
}
}
]
process = [
// event1
{
name = event1_data,
type = sql,
sql = """
SELECT ('TITLE:' || col1) as key,
( CAST(col2 AS VARCHAR) || CAST(':' AS VARCHAR) || col3) as field,
('"field4":"' || col4 || '","field5":' || CAST(col5 AS VARCHAR) ||
',"typeId":"' || typeCol || '","timestamp":' || CAST(`timestamp` AS VARCHAR))
as hash_value
FROM
( SELECT
context.some_id,
struct1.inner_struct1.context.some_id,
struct2.inner_struct2.some_type,
struct1.some_field,
context.`timestamp`
FROM topic1
WHERE struct1.inner_struct1.context.some_id = ‘some_value'
AND context.some_id IS NOT NULL
) subq
""",
},
{
name = write_data_to_redis,
type = redis_update,
host = …
port = …,
input = event1_data,
operation = "hset", // Redis operation. It can be a constant or a column
key = "key", // Redis key: column
value = "hash_value" // Redis value
field = "field"
expire = 1209600 // 2 weeks
},
]
sink = [
{
type: no-op,
input: write_data_to_redis
},
]

Putting it together
EVENT FLOW AT RUNTIME
• Event published into Kafka
• Event fetched by Flink pipeline
• Deserialize the Thrift structured event into SQL row (structured source) or retain binary message
• Execute SQL on the rows (for SQL Process) or custom UDF on incoming message
• Outputs sent to next process and/or sink

Architecture
DATA FLOW
Conf
Airstream
Driver
YARN
Flink
….
Sources
Jitney
....
Sinks
HDFS
Checkpoints
Hive
CSV
Redis
HBase
Internal

Architecture
BENEFITS
• Ease and speed of pipeline development
• Reuse of sources and sinks
• SQL lowers barrier of entry
• Shields user from underlying infrastructure and its changes
• Extensible

Use cases
• Tracking user activity events
• Realtime feedback loop into products
• Fraud signal detection pipelines
• User device identity graph
• High frequency tracing data pipeline

Deep dive:
Realtime
merchandising

Real time merchandising profiler
• Gather signals from the user throughout their journey
• Realtime: Immediately use that information to power subsequent experience
• Categorization
• Personalization
Services
Kafka
Realtime profile
store
Airstream
Flink

Future work
• Tooling to debug and troubleshoot issues
• Testability
• Expose more streaming features

Summary
WHAT WE ACHIEVED
• Lower barrier of entry by leveraging structured data and SQL
• Allow users to define pipeline through configuration
• Decouple pipelines from underlying physical infrastructure
• Extensibility that allows easy support for infrastructure changes

Flink Forward San Francisco 2019: Building production Flink jobs with Airstream at Airbnb - Pala Muthiah & Hao Wang

Flink Forward San Francisco 2019: Building production Flink jobs with Airstream at Airbnb - Pala Muthiah & Hao Wang

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Flink Forward San Francisco 2019: Building production Flink jobs with Airstream at Airbnb - Pala Muthiah & Hao Wang

Similaire à Flink Forward San Francisco 2019: Building production Flink jobs with Airstream at Airbnb - Pala Muthiah & Hao Wang (20)

Plus de Flink Forward

Plus de Flink Forward (20)

Dernier

Dernier (20)

Flink Forward San Francisco 2019: Building production Flink jobs with Airstream at Airbnb - Pala Muthiah & Hao Wang