Computation convergence problems, auto-generate Beam API code from Pig scripts, convergences at LinkedIn with AORA (Author Once Run Anywhere) principle.
Blog post:
https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite
Code example:
Pig script: https://gist.github.com/khaitranq/1d06c27832f15fa52a4a7e2fa7bec340
Beam autogen code: https://gist.github.com/khaitranq/785dbb8495cd382788f3ca8200231d8
4. Online, nearline, and offline computation
Messaging
Systems
Near Real Time Processing
(Streaming Engines)
Online Processing
(OLTP Engines)
Offline Processing
(Batch Engines)
Application servers
Tracking events
DB changes
OLTP databases
HDFS
4
DB dumps
5. Convergence problems
Online - offline
● Execute OLTP query logics in
batch engines
Example
● Online query: compute the
public profile of a LinkedIn
member from the Profile table
● Batch computation: Execute
the same logic of computing
public profile on all LinkedIn
members
Online - nearline
● Execute OLTP query logics in
streaming engines
Example
● Online query: compute the
public profile of a LinkedIn
member from the Profile table
● Streaming computation:
Incrementally compute public
profiles on database changes
captured from the Profile table
Offline - nearline
● Execute the logics of batch
scripts in streaming engines
Example
● Batch scripts: scripts to
compute metrics from raw
tracking events
● Streaming computation:
Deliver the metrics with same
transformation logic as batch
scripts in low latency.
5
7. LinkedIn Unified Metrics Platform (UMP)
Site-facing
Apps
Experimentation
Reporting
Raw Tracking Data
Unified
Metrics
Platform
A platform for engineers and
data scientists to define and
onboard their metrics
7
8. Example - Metrics in reporting
Number of RPC calls to HDFS namenode by command types
8
9. The onboarding process
# code
LOAD …
# data
# transformation
# code
STORE …
# config
Metrics:
A = SUM(A’)
B = Unique(id)
Dimensions
C, D
Downstream apps
Raptor
User Code
Platform
Generated
Code
To App
DefineDeclare
Onboard
Data
MetadataUser To App
UMP
9
11. UMP offline computation flows
Latency at least 2-3 hours
......
Metric union
User code
User code
Cubing, Rollup
Dimension
decoration
HDFS tables,
Dali views
Pinot,
Presto
Azkaban execution
Espresso,
Oracle,
MySQL
Espresso: LinkedIn distributed document store
Goblin: LinkedIn universal data ingestion framework
Dali view: LinkedIn abstraction layer on top of HDFS
Azkaban: LinkedIn batch workflow job scheduler
Pinot: LinkedIn real-time OLAP engine
11
12. What we want for nearline flows
......
Metric union
User code
User code
Dimension
decoration
Pinot
Samza jobs
12
Samza: LinkedIn streaming engine
13. Latency is not the
only requirement
• Low latency (~ minutes)
• Easy to onboard
• Easy to maintain
13
14. Putting things together
Samza jobs
Batch jobs
UMP nearline platform
UMP offline platform
Raptor
Lambda architecture with a single codebase
code configMetrics
definition
HDFS
Pinot
14
16. 10,000 feet view
...
Metric union
User code
User code
Dimension
decoration Calcite relational algebra
as an IR
convert generateoptimize
Beam physical plan
Pig to Calcite Calcite to Beam
Streaming
config
Beam Java API code
16
Check out this blog post for details:
https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite
17. Pig to Calcite
# code
LOAD …
LOAD ...
COGROUP
...
STORE …
GruntParser
CO-
GROUP
LOAD LOAD
PigRelConverter
FULL
OUTER
JOIN
AGGRE-
GATE
AGGRE-
GATE
TABLE
SCAN
TABLE
SCAN
PRO-
JECT
User scripts Pig Logical Plan
Calcite logical plans
(relational algebra)
Code will be available in Calcite 21
17
18. Calcite to Beam
Planner/optimizer
• Calcite logical plan: What to do.
• Beam physical plan: How to do.
• Calcite Beam planner: optimized Calcite
logical plans into Beam physical plans
(using Calcite Volcano optimizer)
Code generator
• Generate Beam Java API code from
Beam physical plan and streaming config
Mappings:
• Beam physical node to Beam APIs.
• Relational expressions to Java
implementation code
18