AirStream is a realtime stream computation framework that supports Flink as one of its processing engines. It allows engineers and data scientists at Airbnb to easily leverage Flink to build real time data pipelines and feedback loops. Multiple mission critical applications have been built on top of it. In this talk, we will start with an overview of AirStream, and describe how we have designed Airstream to leverage SQL support in Flink to allow users to easily build real time data pipelines. We will go over a few production use cases such as building a user activity profiler and building user identity mapping in realtime. We will also cover how we have integrated Airstream into the data infrastructure ecosystem at Airbnb through easily configurable connectors such as Kafka and Hive that allow users to easily leverage these components in their pipelines.
4. What is Airstream
• A framework to define and execute data pipelines
• Pipelines created by stitching together building blocks
• Pipelines defined through configuration
• Philosophy: Make simple things easy, complex things possible
5. Why Flink
• Low latency streaming
• Full SQL support
• Stability - Battle tested
• Adoption within the industry
6. Components of a pipeline
• Source
• Process
• Sink
Sources Processes Sinks
7. Components of a pipeline
SOURCES
• Structured and unstructured
Jitney
• Static and dynamic data sources
Kafka, HDFS etc
8. Components of a pipeline
JITNEY SOURCE
• Structured and versioned messages
• Majority of events in Airbnb online services are published as Jitney messages
• Enables SQL: Thrift schema is translated into SQL Schema
Thrift
Message 1:
Field1: Int
Field2: String
Field3: Nested Struct
Field4: Collection
….
SQL
Table 1:
Field 1 Field 2 Field 3 …
9. Components of a pipeline
PROCESS
• Unit of logic
• SQL on structured messages
• Custom UDFs on unstructured messages
10. Components of a pipeline
SQL PROCESS ON STRUCTURED MESSAGES
Write SQL on the converted Thrift Schema
Thrift
Message1:
Field1: Int
Field2: String
Field3: Nested Struct
Field4: Collection
….
SQL
Table 1:
Field 1 Field 2 Field 3 …
SQL
select * from table1
where field1 = 2003
and field3.nestedval1 =
‘booking’ …
11. Components of a pipeline
PROCESS: CUSTOM USER DEFINED FUNCTION
• Useful for arbitrary logic not expressible through SQL
• Logic with side effects. e.g: State machine implemented with external storage
• Source is usually kept unstructured when it is processed by custom UDFs
12. Component of a pipeline
SINKS
• Persist pipeline output
• Variety of sinks
Kafka
Jitney
HDFS
HTTP
Metrics (Datadog)
13. Putting it together
SAMPLE CONFIGURATION
config.checkpoint = false
source = [
{
name: topic1,
type: kafka,
config: {
topic: …,
broker: …
serde: jitney,
eventClassName: “com.logging.…”,
…
}
},
{
name: topic2,
type: kafka,
config: {
topic: …,
broker: …
serde: jitney,
eventClassName: “com.logging…”,
…
}
}
]
process = [
// event1
{
name = event1_data,
type = sql,
sql = """
SELECT ('TITLE:' || col1) as key,
( CAST(col2 AS VARCHAR) || CAST(':' AS VARCHAR) || col3) as field,
('"field4":"' || col4 || '","field5":' || CAST(col5 AS VARCHAR) ||
',"typeId":"' || typeCol || '","timestamp":' || CAST(`timestamp` AS VARCHAR))
as hash_value
FROM
( SELECT
context.some_id,
struct1.inner_struct1.context.some_id,
struct2.inner_struct2.some_type,
struct1.some_field,
context.`timestamp`
FROM topic1
WHERE struct1.inner_struct1.context.some_id = ‘some_value'
AND context.some_id IS NOT NULL
) subq
""",
},
{
name = write_data_to_redis,
type = redis_update,
host = …
port = …,
input = event1_data,
operation = "hset", // Redis operation. It can be a constant or a column
key = "key", // Redis key: column
value = "hash_value" // Redis value
field = "field"
expire = 1209600 // 2 weeks
},
]
sink = [
{
type: no-op,
input: write_data_to_redis
},
]
14. Putting it together
EVENT FLOW AT RUNTIME
• Event published into Kafka
• Event fetched by Flink pipeline
• Deserialize the Thrift structured event into SQL row (structured source) or retain binary message
• Execute SQL on the rows (for SQL Process) or custom UDF on incoming message
• Outputs sent to next process and/or sink
16. Architecture
BENEFITS
• Ease and speed of pipeline development
• Reuse of sources and sinks
• SQL lowers barrier of entry
• Shields user from underlying infrastructure and its changes
• Extensible
17. Use cases
• Tracking user activity events
• Realtime feedback loop into products
• Fraud signal detection pipelines
• User device identity graph
• High frequency tracing data pipeline
19. Real time merchandising profiler
• Gather signals from the user throughout their journey
• Realtime: Immediately use that information to power subsequent experience
• Categorization
• Personalization
Services
Kafka
Realtime profile
store
Airstream
Flink
20. Future work
• Tooling to debug and troubleshoot issues
• Testability
• Expose more streaming features
21. Summary
WHAT WE ACHIEVED
• Lower barrier of entry by leveraging structured data and SQL
• Allow users to define pipeline through configuration
• Decouple pipelines from underlying physical infrastructure
• Extensibility that allows easy support for infrastructure changes