Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Conquering the Lambda architecture in LinkedIn metrics
platform with Apache Calcite and Apache Samza
​Khai Tran
​Staff Sof...
Agenda
● Overview of LinkedIn metrics platform
● Moving from offline to nearline
● Under the hood of the nearline architec...
Overview of LinkedIn metrics platform
Metrics @ LinkedIn
● Metrics = Measurements over tracking data
● Crucial for decision making:
○ Experimentation - test eve...
We provide:
● A trusted repository of metrics
● A self-serve platform for
sustainable lifecycle of metrics
In
production
E...
Growth of UMP Metrics
2016 20172015
6800
4680
1100
Current: 8000+ metrics
# code
LOAD …
# data
# transformation
# code
STORE …
# config
Metrics:
- A = SUM(A’)
- B = Unique(id)
Downstream:
- XLNT
-...
Moving from offline to nearline
Offline computation flows
Hourly job latency: 3-6 hours -> want realtime/nearline
......
Metric union
User code
User code
...
...
What we want for nearline flows
Metric unionUser code
User code
Samza job
Dimension
augmentation
Pinot
Latency is not the only requirement
Easy to onboard ● Minimum effort to convert existing offline into nearline
● Easy to w...
Samza jobs
Putting things together
Pinot
Batch
jobs
UMP realtime platform
UMP offline platform
HDFS
Raptor
Lambda architec...
Current support
User code in Pig ● LOAD, STORE
● FILTER, SAMPLE, SPLIT, UNION
● Simple FOREACH
● JOIN - all semantics
● GR...
Under the hood of the nearline architecture
Pig to Samza through SQL processing
Open source framework for building dynamic
data management systems. Including:
➢ SQL P...
Architecture
...
Metric union
User code
User code
Dimension
augmentation
Calcite relational
algebra as an IR
convert gener...
Pig to Calcite
# code
LOAD …
LOAD ...
COGROUP ...
STORE …
GruntParser
CO-
GROUP
LOAD LOAD
PigRelConverter
FULL
OUTERJ
OIN
...
Example
Example
Example
INNER
JOIN
FILTER FILTER
PROJECT PROJECT
PROJECT
TABLE
SCAN
TABLE
SCAN
Calcite logical plan
Planning/Optimization
➢ Calcite logical plans:
○ Relational algebra: What to do
➢ Samza physical plans:
○ Samza physical n...
Example
Stream-
Stream Self
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
INNER
JO...
Code-gen
From Samza physical plans:
➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs .
Mapp...
Example
Schema and UDF declarations
Operator mapping
Filter functions
Map functions
Produce to Kafka
...
Config-gen
Stream
Stream
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
# dataset.c...
Nearline production use case - Storylines
Top stories picked
up by editors
Feedback to editor - powered by UMP realtime
Conclusion
Samza jobs
From improved Lambda architecture...
Pinot
Batch
jobs
UMP realtime platform
UMP offline platform
HDFS
Raptor
La...
… to our bigger picture
Pig Latin
Calcite
relational
algebra
HiveQL
SparkSQL/
RDD
Presto SQL
Portable
UDFs
AORA (Author On...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza
Prochain SlideShare
Chargement dans…5
×

Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

Metrics play an important role in data-driven companies like LinkedIn, where we leverage them extensively for reporting, experimentation, and in-product applications. We built an offline platform to help people define and produce metrics driven through their transformation code, mostly in Pig or Hive, and metadata-rich configurations. Many of our users would like to look at these metrics in a real-time fashion. To support this, we recently built an extension to the platform that auto-generates Samza real-time flow from existing offline transformation code with just a single command. Combining with the existing offline platform, we delivered Lambda architecture without maintaining multiple code bases.

In this talk, we will describe how we use Apache Calcite to translate our offline logic, served as the single source of truth, into both Samza code and configuration for real-time execution.

  • Identifiez-vous pour voir les commentaires

Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

  1. 1. Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza ​Khai Tran ​Staff Software Engineer
  2. 2. Agenda ● Overview of LinkedIn metrics platform ● Moving from offline to nearline ● Under the hood of the nearline architecture ● Nearline production usecase ● Conclusion
  3. 3. Overview of LinkedIn metrics platform
  4. 4. Metrics @ LinkedIn ● Metrics = Measurements over tracking data ● Crucial for decision making: ○ Experimentation - test everything ○ Reporting - monitor and alert ○ In production, site-facing applications
  5. 5. We provide: ● A trusted repository of metrics ● A self-serve platform for sustainable lifecycle of metrics In production Experimentation Reporting Primary Data Unified Metrics Platform LinkedIn unified metrics platform (UMP)
  6. 6. Growth of UMP Metrics 2016 20172015 6800 4680 1100 Current: 8000+ metrics
  7. 7. # code LOAD … # data # transformation # code STORE … # config Metrics: - A = SUM(A’) - B = Unique(id) Downstream: - XLNT - Raptor UMP User Code Platform Generated Code To App To App DefineDeclare Onboard Data Metadata Onboarding process User
  8. 8. Moving from offline to nearline
  9. 9. Offline computation flows Hourly job latency: 3-6 hours -> want realtime/nearline ...... Metric union User code User code Cubing, Rollup Dimension augmentation HDFS tables Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL
  10. 10. ... What we want for nearline flows Metric unionUser code User code Samza job Dimension augmentation Pinot
  11. 11. Latency is not the only requirement Easy to onboard ● Minimum effort to convert existing offline into nearline ● Easy to write user code for new nearline flows Easy to maintain ● Just one version of user code - single source of truth ● Run as a service Latency ● ~5 - 30 mins
  12. 12. Samza jobs Putting things together Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  13. 13. Current support User code in Pig ● LOAD, STORE ● FILTER, SAMPLE, SPLIT, UNION ● Simple FOREACH ● JOIN - all semantics ● GROUP/COGROUP, DISTINCT ● Record/Array FLATTEN ● Java UDFs, Python UDFs ● Pig Nested FOREACH and sort/limit (in Windows) ● Hive Not yet
  14. 14. Under the hood of the nearline architecture
  15. 15. Pig to Samza through SQL processing Open source framework for building dynamic data management systems. Including: ➢ SQL Parser ➢ Relational algebra APIs ➢ Query planning engine We built UMP nearline with: ➢ Pig’s Grunt parser ➢ Calcite relational algebra ➢ Calcite query planning engine
  16. 16. Architecture ... Metric union User code User code Dimension augmentation Calcite relational algebra as an IR convert generate Samza code optimize Samza physical plan Samza configuration Pig to Calcite Calcite to Samza
  17. 17. Pig to Calcite # code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTERJ OIN AGGRE GATE AGGRE GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite relational algebra
  18. 18. Example
  19. 19. Example
  20. 20. Example INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite logical plan
  21. 21. Planning/Optimization ➢ Calcite logical plans: ○ Relational algebra: What to do ➢ Samza physical plans: ○ Samza physical node: How to do it ➢ Calcite Samza planner: ○ Calcite logical plan -> optimized Samza physical plan
  22. 22. Example Stream- Stream Self Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite Samza planner Calcite logical plan Samza physical plan
  23. 23. Code-gen From Samza physical plans: ➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs . Mapping: ➢ Samza physical nodes -> corresponding stream APIs: ○ Samza project -> stream.map() ○ Samza filter -> stream.filter() ○ ... ➢ Relational expressions -> lamba functions: ○ Filter expressions -> filter() functions ○ Project expressions -> map() functions ○ ...
  24. 24. Example Schema and UDF declarations Operator mapping Filter functions Map functions Produce to Kafka ...
  25. 25. Config-gen Stream Stream Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream # dataset.conf app-src app-def
  26. 26. Nearline production use case - Storylines
  27. 27. Top stories picked up by editors
  28. 28. Feedback to editor - powered by UMP realtime
  29. 29. Conclusion
  30. 30. Samza jobs From improved Lambda architecture... Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  31. 31. … to our bigger picture Pig Latin Calcite relational algebra HiveQL SparkSQL/ RDD Presto SQL Portable UDFs AORA (Author Once, Run Anywhere) architecture

×