SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Unifying Batch and Stream
Data Processing with Apache
Calcite and Apache Beam
Khai Tran
Big Data @LinkedIn
1
1
2
3
4
Agenda
Computation
convergence problems
LinkedIn metrics
platform
From offline to
nearline
Deep dive
2
Computation convergence problems
3
Online, nearline, and offline computation
Messaging
Systems
Near Real Time Processing
(Streaming Engines)
Online Processing
(OLTP Engines)
Offline Processing
(Batch Engines)
Application servers
Tracking events
DB changes
OLTP databases
HDFS
4
DB dumps
Convergence problems
Online - offline
● Execute OLTP query logics in
batch engines
Example
● Online query: compute the
public profile of a LinkedIn
member from the Profile table
● Batch computation: Execute
the same logic of computing
public profile on all LinkedIn
members
Online - nearline
● Execute OLTP query logics in
streaming engines
Example
● Online query: compute the
public profile of a LinkedIn
member from the Profile table
● Streaming computation:
Incrementally compute public
profiles on database changes
captured from the Profile table
Offline - nearline
● Execute the logics of batch
scripts in streaming engines
Example
● Batch scripts: scripts to
compute metrics from raw
tracking events
● Streaming computation:
Deliver the metrics with same
transformation logic as batch
scripts in low latency.
5
Metrics platform at LinkedIn
6
LinkedIn Unified Metrics Platform (UMP)
Site-facing
Apps
Experimentation
Reporting
Raw Tracking Data
Unified
Metrics
Platform
A platform for engineers and
data scientists to define and
onboard their metrics
7
Example - Metrics in reporting
Number of RPC calls to HDFS namenode by command types
8
The onboarding process
# code
LOAD …
# data
# transformation
# code
STORE …
# config
Metrics:
A = SUM(A’)
B = Unique(id)
Dimensions
C, D
Downstream apps
Raptor
User Code
Platform
Generated
Code
To App
DefineDeclare
Onboard
Data
MetadataUser To App
UMP
9
Moving from offline to nearline
10
UMP offline computation flows
Latency at least 2-3 hours
......
Metric union
User code
User code
Cubing, Rollup
Dimension
decoration
HDFS tables,
Dali views
Pinot,
Presto
Azkaban execution
Espresso,
Oracle,
MySQL
Espresso: LinkedIn distributed document store
Goblin: LinkedIn universal data ingestion framework
Dali view: LinkedIn abstraction layer on top of HDFS
Azkaban: LinkedIn batch workflow job scheduler
Pinot: LinkedIn real-time OLAP engine
11
What we want for nearline flows
......
Metric union
User code
User code
Dimension
decoration
Pinot
Samza jobs
12
Samza: LinkedIn streaming engine
Latency is not the
only requirement
• Low latency (~ minutes)
• Easy to onboard
• Easy to maintain
13
Putting things together
Samza jobs
Batch jobs
UMP nearline platform
UMP offline platform
Raptor
Lambda architecture with a single codebase
code configMetrics
definition
HDFS
Pinot
14
Deep dive on offline-nearline
conversion
15
10,000 feet view
...
Metric union
User code
User code
Dimension
decoration Calcite relational algebra
as an IR
convert generateoptimize
Beam physical plan
Pig to Calcite Calcite to Beam
Streaming
config
Beam Java API code
16
Check out this blog post for details:
https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite
Pig to Calcite
# code
LOAD …
LOAD ...
COGROUP
...
STORE …
GruntParser
CO-
GROUP
LOAD LOAD
PigRelConverter
FULL
OUTER
JOIN
AGGRE-
GATE
AGGRE-
GATE
TABLE
SCAN
TABLE
SCAN
PRO-
JECT
User scripts Pig Logical Plan
Calcite logical plans
(relational algebra)
Code will be available in Calcite 21
17
Calcite to Beam
Planner/optimizer
• Calcite logical plan: What to do.
• Beam physical plan: How to do.
• Calcite Beam planner: optimized Calcite
logical plans into Beam physical plans
(using Calcite Volcano optimizer)
Code generator
• Generate Beam Java API code from
Beam physical plan and streaming config
Mappings:
• Beam physical node to Beam APIs.
• Relational expressions to Java
implementation code
18
Example - Pig script
19
Example - Calcite logical plan
20
Example - Calcite logical plan
Inner
Join
Filter Filter
Project Project
Table Scan
Table Scan
Project
Aggregate
Project
21
Example - Calcite Beam Planner
Stream
Stream
SelfJoin
Beam
Project
Beam
Project
Beam
Filter
Beam
Filter
Input
Stream
Inner Join
Filter Filter
Project Project
Table
Scan
Table
Scan
Calcite Beam
planner
Calcite logical plan Beam physical plan
Project
Aggregate
Project
Beam
Project
Beam
Aggregate
Beam
Project
22
Example - Beam autogen code for LOAD
Original Pig script
Beam API code
Details:
● Pig script: https://gist.github.com/khaitranq/1d06c27832f15fa52a4a7e2fa7bec340
● Beam code: https://gist.github.com/khaitranq/785dbb8495cd382788f3ca8200231d84
23
Example - Beam autogen code for FILTER
Original Pig script
Beam API code
24
Example - Beam autogen code for FOREACH
Original Pig script
Beam API code
25
Example - Beam autogen code for JOIN
Original Pig script Beam API code
26
Example - Beam autogen code for JOIN
Original Pig script Beam API code
27
Example - Beam autogen code for GROUP BY
Original Pig script Beam API code
28
Example - Beam autogen code for GROUP BY
Original Pig script Beam API code
Initialize
Aggregate
Return
29
Example - Beam autogen code for STORE
Original Pig script
Beam API code
30
Example - Beam autogen code for Pig UDFs
Original Pig script
Beam API code
Declare
Init
Use
31
Convergences at LinkedIn (1)
Offline computation
Intermediate representation
Online computation
Nearline computation
32
Convergences at LinkedIn (2)
Implemented
• Pig - Calcite - Beam (on Samza)
• Hive - Calcite - Presto
• Hive - Calcite - Spark
• GraphQL - Calcite - Spark
• Spark - Calcite - Beam (on Samza)
Considering
• Hive - Calcite - Pig
• Pig - Calcite - Spark
• GraphQL - Calcite - Beam (on Samza)
33
AORA principle:
Author Once, Run Anywhere
34
Thank you
35

Contenu connexe

Tendances

Spline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured StreamingSpline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured StreamingVaclav Kosar
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...Flink Forward
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceStefan Richter
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsconfluent
 
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...Flink Forward
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
End to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka ConnectEnd to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka Connectconfluent
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkEventador
 
Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flinkKenny Gorman
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 

Tendances (20)

Spline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured StreamingSpline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured Streaming
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And Performance
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systems
 
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
End to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka ConnectEnd to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka Connect
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on Flink
 
Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flink
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 

Similaire à Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calcite and Apache Beam

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisBDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisAmazon Web Services
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Corneil du Plessis
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudChangshu Liu
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureGyula Fóra
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Getting started with apache flink streaming api
Getting started with apache flink streaming apiGetting started with apache flink streaming api
Getting started with apache flink streaming apiPreetdeep Kumar
 
API workshop by AWS and 3scale
API workshop by AWS and 3scaleAPI workshop by AWS and 3scale
API workshop by AWS and 3scale3scale
 
CCT (Check and Calculate Transfer)
CCT (Check and Calculate Transfer)CCT (Check and Calculate Transfer)
CCT (Check and Calculate Transfer)Francesca Pappalardo
 
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28Amazon Web Services
 
Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
 
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...apidays
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R MeetupJo-fai Chow
 

Similaire à Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calcite and Apache Beam (20)

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisBDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Getting started with apache flink streaming api
Getting started with apache flink streaming apiGetting started with apache flink streaming api
Getting started with apache flink streaming api
 
API workshop by AWS and 3scale
API workshop by AWS and 3scaleAPI workshop by AWS and 3scale
API workshop by AWS and 3scale
 
CCT (Check and Calculate Transfer)
CCT (Check and Calculate Transfer)CCT (Check and Calculate Transfer)
CCT (Check and Calculate Transfer)
 
Presentation CCT
Presentation CCTPresentation CCT
Presentation CCT
 
CCT Check and Calculate Transfer
CCT Check and Calculate TransferCCT Check and Calculate Transfer
CCT Check and Calculate Transfer
 
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
 
Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT Management
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
 

Dernier

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Dernier (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calcite and Apache Beam

  • 1. Unifying Batch and Stream Data Processing with Apache Calcite and Apache Beam Khai Tran Big Data @LinkedIn 1
  • 4. Online, nearline, and offline computation Messaging Systems Near Real Time Processing (Streaming Engines) Online Processing (OLTP Engines) Offline Processing (Batch Engines) Application servers Tracking events DB changes OLTP databases HDFS 4 DB dumps
  • 5. Convergence problems Online - offline ● Execute OLTP query logics in batch engines Example ● Online query: compute the public profile of a LinkedIn member from the Profile table ● Batch computation: Execute the same logic of computing public profile on all LinkedIn members Online - nearline ● Execute OLTP query logics in streaming engines Example ● Online query: compute the public profile of a LinkedIn member from the Profile table ● Streaming computation: Incrementally compute public profiles on database changes captured from the Profile table Offline - nearline ● Execute the logics of batch scripts in streaming engines Example ● Batch scripts: scripts to compute metrics from raw tracking events ● Streaming computation: Deliver the metrics with same transformation logic as batch scripts in low latency. 5
  • 6. Metrics platform at LinkedIn 6
  • 7. LinkedIn Unified Metrics Platform (UMP) Site-facing Apps Experimentation Reporting Raw Tracking Data Unified Metrics Platform A platform for engineers and data scientists to define and onboard their metrics 7
  • 8. Example - Metrics in reporting Number of RPC calls to HDFS namenode by command types 8
  • 9. The onboarding process # code LOAD … # data # transformation # code STORE … # config Metrics: A = SUM(A’) B = Unique(id) Dimensions C, D Downstream apps Raptor User Code Platform Generated Code To App DefineDeclare Onboard Data MetadataUser To App UMP 9
  • 10. Moving from offline to nearline 10
  • 11. UMP offline computation flows Latency at least 2-3 hours ...... Metric union User code User code Cubing, Rollup Dimension decoration HDFS tables, Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL Espresso: LinkedIn distributed document store Goblin: LinkedIn universal data ingestion framework Dali view: LinkedIn abstraction layer on top of HDFS Azkaban: LinkedIn batch workflow job scheduler Pinot: LinkedIn real-time OLAP engine 11
  • 12. What we want for nearline flows ...... Metric union User code User code Dimension decoration Pinot Samza jobs 12 Samza: LinkedIn streaming engine
  • 13. Latency is not the only requirement • Low latency (~ minutes) • Easy to onboard • Easy to maintain 13
  • 14. Putting things together Samza jobs Batch jobs UMP nearline platform UMP offline platform Raptor Lambda architecture with a single codebase code configMetrics definition HDFS Pinot 14
  • 15. Deep dive on offline-nearline conversion 15
  • 16. 10,000 feet view ... Metric union User code User code Dimension decoration Calcite relational algebra as an IR convert generateoptimize Beam physical plan Pig to Calcite Calcite to Beam Streaming config Beam Java API code 16 Check out this blog post for details: https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite
  • 17. Pig to Calcite # code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTER JOIN AGGRE- GATE AGGRE- GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite logical plans (relational algebra) Code will be available in Calcite 21 17
  • 18. Calcite to Beam Planner/optimizer • Calcite logical plan: What to do. • Beam physical plan: How to do. • Calcite Beam planner: optimized Calcite logical plans into Beam physical plans (using Calcite Volcano optimizer) Code generator • Generate Beam Java API code from Beam physical plan and streaming config Mappings: • Beam physical node to Beam APIs. • Relational expressions to Java implementation code 18
  • 19. Example - Pig script 19
  • 20. Example - Calcite logical plan 20
  • 21. Example - Calcite logical plan Inner Join Filter Filter Project Project Table Scan Table Scan Project Aggregate Project 21
  • 22. Example - Calcite Beam Planner Stream Stream SelfJoin Beam Project Beam Project Beam Filter Beam Filter Input Stream Inner Join Filter Filter Project Project Table Scan Table Scan Calcite Beam planner Calcite logical plan Beam physical plan Project Aggregate Project Beam Project Beam Aggregate Beam Project 22
  • 23. Example - Beam autogen code for LOAD Original Pig script Beam API code Details: ● Pig script: https://gist.github.com/khaitranq/1d06c27832f15fa52a4a7e2fa7bec340 ● Beam code: https://gist.github.com/khaitranq/785dbb8495cd382788f3ca8200231d84 23
  • 24. Example - Beam autogen code for FILTER Original Pig script Beam API code 24
  • 25. Example - Beam autogen code for FOREACH Original Pig script Beam API code 25
  • 26. Example - Beam autogen code for JOIN Original Pig script Beam API code 26
  • 27. Example - Beam autogen code for JOIN Original Pig script Beam API code 27
  • 28. Example - Beam autogen code for GROUP BY Original Pig script Beam API code 28
  • 29. Example - Beam autogen code for GROUP BY Original Pig script Beam API code Initialize Aggregate Return 29
  • 30. Example - Beam autogen code for STORE Original Pig script Beam API code 30
  • 31. Example - Beam autogen code for Pig UDFs Original Pig script Beam API code Declare Init Use 31
  • 32. Convergences at LinkedIn (1) Offline computation Intermediate representation Online computation Nearline computation 32
  • 33. Convergences at LinkedIn (2) Implemented • Pig - Calcite - Beam (on Samza) • Hive - Calcite - Presto • Hive - Calcite - Spark • GraphQL - Calcite - Spark • Spark - Calcite - Beam (on Samza) Considering • Hive - Calcite - Pig • Pig - Calcite - Spark • GraphQL - Calcite - Beam (on Samza) 33
  • 34. AORA principle: Author Once, Run Anywhere 34