SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Building Data Products
on Spark at Airbnb
LIYIN TANG & JINGWEI LU
Data Infrastructure at Airbnb
Event
Logs
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka
Sqoop
Silver Cluster Spark Cluster
Spark
ReAir
Airflow Scheduling
S3
Presto Cluster
AirPal
SuperSet
Tableau
Batch Infrastructure
Yarn HDFS
Hive
Yarn
Liyin Tang and Jingwei Lu
3
Streaming at Airbnb
Liyin Tang and Jingwei Lu
4
Cluster
Spark Streaming
Airflow Scheduling
HBase
HDFS
Sources
Kafka
S3
HDFS
…
Sinks
Datadog
Kafka
Dynamo
DB
Elastic
Search
…
Lambda Architecture
Batch
AirStream
Hive
Spark SQL
Lambda Architecture
Liyin Tang and Jingwei Lu
6
Streaming
Kafka
Spark Streaming
State Storage
Combine Streaming and
Batch Processing
Sources
Liyin Tang and Jingwei Lu
8
Streaming
source: [
{
name: source_example,
type: kafka,
config: {
topic: "example_topic",
}
}
]
Batch
source: [
{
name: source_example,
type: hive,
sql: {
select * from db.table where
ds=‘2017-06-05’;
}
}
]
Computation
Liyin Tang and Jingwei Lu
9
Streaming/Batch
process: [{
name = process_example,
type = sql,
sql = """
SELECT listing_id, checkin_date, context.source as source
FROM source_example
WHERE user_id IS NOT NULL """
}]
Sinks
Liyin Tang and Jingwei Lu
10
Streaming
sink: [
{
name = sink_example
input = process_example
type = hbase_update
hbase_table_name = test_table
bulk_upload = false
}
]
Batch
sink: [
{
name = sink_example
input = process_example
type = hbase_update
hbase_table_name = test_table
bulk_upload = true
}
]
Streaming
Computation Flow
Liyin Tang and Jingwei Lu
11
Source
Process_A Process_B
Process_A1
Sink_A2 Sink_B2
Batch
Source
Process_A Process_B
Process_A1
Sink_A2 Sink_B2
Liyin Tang and Jingwei Lu
Unified API through AirStream
• Declarative job configuration
• Streaming source vs static source
• Computation operator or sink can be shared by streaming
and batch job.
• Computation flow is shared by streaming and batch
• Single driver executes in both streaming and batch mode job
12
Shared State Storage
AirStream
Shared Global State Store
Liyin Tang and Jingwei Lu
14
HBase Tables
Spark StreamingSpark StreamingSpark StreamingSpark Streaming
Spark BatchSpark BatchSpark BatchSpark Batch
•Well integrated with Hadoop eco system
•Efficient API for streaming writes and bulk uploads
•Rich API for sequential scan and point-lookups
• Merged view based on version
15
Why HBase
Unified Write API
Liyin Tang and Jingwei Lu
16
DataFrame
HBase
Region 1
Region 2
Region N
Re-partition
<Region 1, [RowKey, Value]>
<Region 2, [RowKey, Value]>
<Region N, [RowKey, Value]>
… …
Puts
HFile
BulkLoad
Rich Read API
Liyin Tang and Jingwei Lu
17
HBase Tables
Spark Streaming/Batch Jobs
Multi-Gets Prefix Scan Time Range Scan
Merged Views
Liyin Tang and Jingwei Lu
18
Row Key
R1 V200 TS200
R1 V150 TS150
R1 V01 TS01
… … … …
Time
Streaming Writes
Streaming Writes
Streaming Writes
Merged Views
Liyin Tang and Jingwei Lu
19
Row Key
R1 V200 TS200
R1 V150 TS150
R1 V01 TS01
Time
Streaming Writes
Streaming Writes
Streaming Writes
R1 V100 TS100Batch Bulk Upload
Liyin Tang and Jingwei Lu
Our Foundations
•Unify streaming with batch process
•Shared global state store
20
Use Cases
MySQL DB Snapshot
Using Binlog Replay
• Large amount of data: Multiple large mysql DBs
• Realtime-ness: minutes delay/ hours delay
• Transaction : Need to keep transaction across different tables
• Schema change: Table schema evolves
Database Snapshot
23
Move Elephant
24
Binlog Replay on Spark
20+ hr 4+ hr
AirStream Job
5 mins
15 mins
1 hr
spinal tap
seed
• Streaming and Batch shares Logic:
Binlog file reader, DDL processor,
transaction processor, DML processor.
• Merged by binlog position: <filenum,
offset>
• Idempotent: Log can be replayed
multiple times.
• Schema changes: Full schema
change history.
25
Log Parser
Transaction
Processor
Change
Processor
Schema
Processor
HBASE
Lambda Architecture
Binlog(realtime/history)
DML
DDL
XVID
Mysql Instance
Realtime Indexing
Hive
Realtime Indexing
Liyin Tang and Jingwei Lu
27
Elastic
Search
es_version
=
mutation id
AirStream
Spark Streaming
Spark Batch
Table A
Event Event Event… …
Kafka
Table B
Table C
Realtime OLAP with
Druid
Druid Ingestion
Liyin Tang and Jingwei Lu
29
Druid
AirStream
Spark
Streaming
Kafka
Dimension
Metrics
Druid Beam
Superset Powered by Druid
30
Tips
Moving Window
Computation
Long Window Computation
33
What if window is weeks,
months, or even years?
Distinct in a Large Window
34
I don’t want
approximation. What
should I do?
Distinct Count
Liyin Tang and Jingwei Lu
35
Row Key
Listing 1 Visitor 01 TS100
Listing 1 Visitor 02 TS100
Listing 1 Visitor 04 TS98
Listing 1 Visitor 03 TS99
Prefix Scan with
TimeRange
Prefix Scan with
TimeRange
Time
Moving Average
Liyin Tang and Jingwei Lu
36
Row Key
Listing 1
Total Review
Cnt: 100
TS100
Listing 1
Total Review
Cnt: 98
TS99
Listing 1
Total Review
Cnt: 01
TS01
Listing 1
Total Review
Cnt: 50
TS50
Count Difference/
Time Elapsed
Count Difference/
Time Elapsed
Time
… … …
… … …
Window 1
Window 2
Streaming Ingestion&
Realtime Interactive
Query
Realtime Ingestion and Interactive Query
Liyin Tang and Jingwei Lu
38
HBase
AirStream
Spark
Streaming
Kafka
Query
Engine
Data
Portal
Spark SQL
Hive SQL
Presto SQL
Interactive Query in SqlLab
39
Schema Enforcement
Streaming Events
Thrift-> DataFrame
Liyin Tang and Jingwei Lu
41
Thrift
Event
https://github.com/airbnb/airbnb-spark-thrift
Thrift
Class
Thrift
Object
Field
Meta
Data
Struct
Type
Field
Value
Row
DataFrame
Summary
Unify Batch and
Streaming Computation
43
Global State Store Using
HBase
44
45
We are hiring
Happy Hour:
6pm, B Restaurant&Bar, 720 Howard St, SF

Contenu connexe

Tendances

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

Tendances (20)

Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Apache Spark at Airbnb
Apache Spark at AirbnbApache Spark at Airbnb
Apache Spark at Airbnb
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 

Similaire à Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liyin Tang

Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
confluent
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
Thomas Weise
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 

Similaire à Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liyin Tang (20)

HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
 
Riga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWSRiga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWS
 
Stream Analytics with SQL on Apache Flink
 Stream Analytics with SQL on Apache Flink Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Data platform evolution
Data platform evolutionData platform evolution
Data platform evolution
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaRaleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS Lambda
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 

Dernier (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liyin Tang

  • 1. Building Data Products on Spark at Airbnb LIYIN TANG & JINGWEI LU
  • 3. Event Logs MySQL Dumps Gold Cluster HDFS Hive Kafka Sqoop Silver Cluster Spark Cluster Spark ReAir Airflow Scheduling S3 Presto Cluster AirPal SuperSet Tableau Batch Infrastructure Yarn HDFS Hive Yarn Liyin Tang and Jingwei Lu 3
  • 4. Streaming at Airbnb Liyin Tang and Jingwei Lu 4 Cluster Spark Streaming Airflow Scheduling HBase HDFS Sources Kafka S3 HDFS … Sinks Datadog Kafka Dynamo DB Elastic Search …
  • 6. Batch AirStream Hive Spark SQL Lambda Architecture Liyin Tang and Jingwei Lu 6 Streaming Kafka Spark Streaming State Storage
  • 8. Sources Liyin Tang and Jingwei Lu 8 Streaming source: [ { name: source_example, type: kafka, config: { topic: "example_topic", } } ] Batch source: [ { name: source_example, type: hive, sql: { select * from db.table where ds=‘2017-06-05’; } } ]
  • 9. Computation Liyin Tang and Jingwei Lu 9 Streaming/Batch process: [{ name = process_example, type = sql, sql = """ SELECT listing_id, checkin_date, context.source as source FROM source_example WHERE user_id IS NOT NULL """ }]
  • 10. Sinks Liyin Tang and Jingwei Lu 10 Streaming sink: [ { name = sink_example input = process_example type = hbase_update hbase_table_name = test_table bulk_upload = false } ] Batch sink: [ { name = sink_example input = process_example type = hbase_update hbase_table_name = test_table bulk_upload = true } ]
  • 11. Streaming Computation Flow Liyin Tang and Jingwei Lu 11 Source Process_A Process_B Process_A1 Sink_A2 Sink_B2 Batch Source Process_A Process_B Process_A1 Sink_A2 Sink_B2
  • 12. Liyin Tang and Jingwei Lu Unified API through AirStream • Declarative job configuration • Streaming source vs static source • Computation operator or sink can be shared by streaming and batch job. • Computation flow is shared by streaming and batch • Single driver executes in both streaming and batch mode job 12
  • 14. AirStream Shared Global State Store Liyin Tang and Jingwei Lu 14 HBase Tables Spark StreamingSpark StreamingSpark StreamingSpark Streaming Spark BatchSpark BatchSpark BatchSpark Batch
  • 15. •Well integrated with Hadoop eco system •Efficient API for streaming writes and bulk uploads •Rich API for sequential scan and point-lookups • Merged view based on version 15 Why HBase
  • 16. Unified Write API Liyin Tang and Jingwei Lu 16 DataFrame HBase Region 1 Region 2 Region N Re-partition <Region 1, [RowKey, Value]> <Region 2, [RowKey, Value]> <Region N, [RowKey, Value]> … … Puts HFile BulkLoad
  • 17. Rich Read API Liyin Tang and Jingwei Lu 17 HBase Tables Spark Streaming/Batch Jobs Multi-Gets Prefix Scan Time Range Scan
  • 18. Merged Views Liyin Tang and Jingwei Lu 18 Row Key R1 V200 TS200 R1 V150 TS150 R1 V01 TS01 … … … … Time Streaming Writes Streaming Writes Streaming Writes
  • 19. Merged Views Liyin Tang and Jingwei Lu 19 Row Key R1 V200 TS200 R1 V150 TS150 R1 V01 TS01 Time Streaming Writes Streaming Writes Streaming Writes R1 V100 TS100Batch Bulk Upload
  • 20. Liyin Tang and Jingwei Lu Our Foundations •Unify streaming with batch process •Shared global state store 20
  • 22. MySQL DB Snapshot Using Binlog Replay
  • 23. • Large amount of data: Multiple large mysql DBs • Realtime-ness: minutes delay/ hours delay • Transaction : Need to keep transaction across different tables • Schema change: Table schema evolves Database Snapshot 23 Move Elephant
  • 24. 24 Binlog Replay on Spark 20+ hr 4+ hr AirStream Job 5 mins 15 mins 1 hr spinal tap seed
  • 25. • Streaming and Batch shares Logic: Binlog file reader, DDL processor, transaction processor, DML processor. • Merged by binlog position: <filenum, offset> • Idempotent: Log can be replayed multiple times. • Schema changes: Full schema change history. 25 Log Parser Transaction Processor Change Processor Schema Processor HBASE Lambda Architecture Binlog(realtime/history) DML DDL XVID Mysql Instance
  • 27. Hive Realtime Indexing Liyin Tang and Jingwei Lu 27 Elastic Search es_version = mutation id AirStream Spark Streaming Spark Batch Table A Event Event Event… … Kafka Table B Table C
  • 29. Druid Ingestion Liyin Tang and Jingwei Lu 29 Druid AirStream Spark Streaming Kafka Dimension Metrics Druid Beam
  • 31. Tips
  • 33. Long Window Computation 33 What if window is weeks, months, or even years?
  • 34. Distinct in a Large Window 34 I don’t want approximation. What should I do?
  • 35. Distinct Count Liyin Tang and Jingwei Lu 35 Row Key Listing 1 Visitor 01 TS100 Listing 1 Visitor 02 TS100 Listing 1 Visitor 04 TS98 Listing 1 Visitor 03 TS99 Prefix Scan with TimeRange Prefix Scan with TimeRange Time
  • 36. Moving Average Liyin Tang and Jingwei Lu 36 Row Key Listing 1 Total Review Cnt: 100 TS100 Listing 1 Total Review Cnt: 98 TS99 Listing 1 Total Review Cnt: 01 TS01 Listing 1 Total Review Cnt: 50 TS50 Count Difference/ Time Elapsed Count Difference/ Time Elapsed Time … … … … … … Window 1 Window 2
  • 38. Realtime Ingestion and Interactive Query Liyin Tang and Jingwei Lu 38 HBase AirStream Spark Streaming Kafka Query Engine Data Portal Spark SQL Hive SQL Presto SQL
  • 39. Interactive Query in SqlLab 39
  • 41. Thrift-> DataFrame Liyin Tang and Jingwei Lu 41 Thrift Event https://github.com/airbnb/airbnb-spark-thrift Thrift Class Thrift Object Field Meta Data Struct Type Field Value Row DataFrame
  • 43. Unify Batch and Streaming Computation 43
  • 44. Global State Store Using HBase 44
  • 45. 45 We are hiring Happy Hour: 6pm, B Restaurant&Bar, 720 Howard St, SF