SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Databricks’ Data Pipelines:
Journey and Lessons Learned
Yu Peng, Burak Yavuz
07/06/2016
Who Are We
Yu Peng
Data Engineer at Databricks
Building Databricks’ next-generation data pipeline
on top of Apache Spark
BS in Xiamen University
Ph.D in The University of Hong Kong
Burak Yavuz
Software Engineer at Databricks
Contributor to Spark since Spark 1.1
Maintainer of Spark Packages
BS in Mechanical Engineering at Bogazici University
MS in Management Science & Engineering at Stanford
University
Building a data pipeline is hard
• At least once or exactly once semantics
• Fault tolerance
• Resource management
• Scalability
• Maintainability
Apache®
Spark™
+ Databricks = Our Solution
• All ETL jobs are built on top of Apache Spark
• Unified solution, everything in the same place
• All ETL jobs are run on Databricks platform
• Platform for Data Engineers and Scientists
• Test out Spark and Databricks new features
Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
Classic Lambda Data Pipeline
service 0
service ...
log collector
…
.
Centralized
Messaging
System
Delta ETL
Batch ETL
Storage
System
service 1
service ...
log collector
….
service x
service ...
log collector
…
.
…...
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Customer
Dep 2
Databricks Data Pipeline Overview
Databricks
Dep
….
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
7
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
8
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
9
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
10
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemonRaw record batch (json)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
11
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
12
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Data analysis
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
13
Log collection (Log-daemon)
• Fault tolerance and at least once semantics
• Streaming
• Batch
• Spark History Server
• Multi-tenant and config driven
• Spark container
14
Log Daemon
logStream1
Service 1
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
…..
Service 2
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
Kinesistopic-1
Service x
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
state files
Log Daemon
Architecture
producer
reader
Message Producer
logStream2
producer
reader
logStreamX
producer
reader
…………... …………... …………...
15
topic-2
Sync Daemon
• Read from Kinesis and Write to DBFS
• Buffer and write in batches (128 MB or 5 Mins)
• Partitioned by date
• A long running Apache Spark job
• Easy to scale up and down
16
Databricks Deployment
ETL Jobs
Databricks
Filesystem
No dedup
Append
Dedup
Overwrite
17
New files
Current day
All files
Previous day
Databricks Jobs
Delta job
(every 10 mins)
Batch job
(daily)
Raw records
Databricks
Filesystem
ETL Tables
(Parquet)
ETL Jobs
• Use the same code for Delta and Batch jobs
• Run as scheduled Databricks jobs
• Use spot instances and fallback to on-demand
• Deliver to Databricks as parquet tables
Lessons Learned
- Partition Pruning can save a lot of time and money
Reduced query time from 2800 seconds to just 15 seconds.
Don’t partition too many levels as it leads to worse metadata discovery
performance and cost.
19
Lessons Learned
- High S3 costs: Lots of LIST Requests
Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s
metadata cache even after write operations.
20
Running It All in Databricks - Jobs
Running It All in Databricks - Spark
Data Analysis & Tools
We get the data in. What’s next?
● Monitoring
● Debugging
● Usage Analysis
● Product Design (A/B testing)
23
Debugging
Access to logs in a matter of seconds thanks to Apache Spark.
24
Monitoring
Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours.
25
Usage Analysis + Product Design
SparkR + ggplot2 = Match made in heaven
26
Summary
Databricks + Apache Spark create a unified platform for:
- ETL
- Data Warehousing
- Data Analysis
- Real time analytics
Issues with DevOps out of the question:
- No need to manage a huge cluster
- Jobs are isolated, they don’t cannibalize each other’s resources
- Can launch any Spark version
Ongoing & Future Work
Structured Streaming
- Reduce Complexity of pipeline:
Sync Daemon + Delta + Batch Jobs => Single Streaming Job
- Reduce Latency
Availability of data in seconds instead of minutes
- Event Time Dashboards
28
Try Apache Spark with Databricks
29
http://databricks.com/try
Thank you.
Have questions about ETL with Spark?
Join us at the Databricks Booth 3.45-6.00pm!

Contenu connexe

Tendances

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

Tendances (20)

Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 

En vedette

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 

En vedette (20)

Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsMorticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark Workflows
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 

Similaire à A Journey into Databricks' Pipelines: Journey and Lessons Learned

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

Similaire à A Journey into Databricks' Pipelines: Journey and Lessons Learned (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Dernier (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 

A Journey into Databricks' Pipelines: Journey and Lessons Learned

  • 1. Databricks’ Data Pipelines: Journey and Lessons Learned Yu Peng, Burak Yavuz 07/06/2016
  • 2. Who Are We Yu Peng Data Engineer at Databricks Building Databricks’ next-generation data pipeline on top of Apache Spark BS in Xiamen University Ph.D in The University of Hong Kong Burak Yavuz Software Engineer at Databricks Contributor to Spark since Spark 1.1 Maintainer of Spark Packages BS in Mechanical Engineering at Bogazici University MS in Management Science & Engineering at Stanford University
  • 3. Building a data pipeline is hard • At least once or exactly once semantics • Fault tolerance • Resource management • Scalability • Maintainability
  • 4. Apache® Spark™ + Databricks = Our Solution • All ETL jobs are built on top of Apache Spark • Unified solution, everything in the same place • All ETL jobs are run on Databricks platform • Platform for Data Engineers and Scientists • Test out Spark and Databricks new features Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
  • 5. Classic Lambda Data Pipeline service 0 service ... log collector … . Centralized Messaging System Delta ETL Batch ETL Storage System service 1 service ... log collector …. service x service ... log collector … . …...
  • 6. Customer Dep 0 Customer Dep 1 Amazon Kinesis Customer Dep 2 Databricks Data Pipeline Overview Databricks Dep ….
  • 7. Customer Dep 0 Customer Dep 1 Amazon Kinesis service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 7
  • 8. Customer Dep 0 Customer Dep 1 Amazon Kinesis service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 8
  • 9. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 9
  • 10. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 10
  • 11. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemonRaw record batch (json) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 11
  • 12. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 12
  • 13. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Data analysis Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 13
  • 14. Log collection (Log-daemon) • Fault tolerance and at least once semantics • Streaming • Batch • Spark History Server • Multi-tenant and config driven • Spark container 14
  • 15. Log Daemon logStream1 Service 1 active.log 2015-11-30-20.log 2015-11-30-19.log log rotation ….. Service 2 active.log 2015-11-30-20.log 2015-11-30-19.log log rotation Kinesistopic-1 Service x active.log 2015-11-30-20.log 2015-11-30-19.log log rotation state files Log Daemon Architecture producer reader Message Producer logStream2 producer reader logStreamX producer reader …………... …………... …………... 15 topic-2
  • 16. Sync Daemon • Read from Kinesis and Write to DBFS • Buffer and write in batches (128 MB or 5 Mins) • Partitioned by date • A long running Apache Spark job • Easy to scale up and down 16
  • 17. Databricks Deployment ETL Jobs Databricks Filesystem No dedup Append Dedup Overwrite 17 New files Current day All files Previous day Databricks Jobs Delta job (every 10 mins) Batch job (daily) Raw records Databricks Filesystem ETL Tables (Parquet)
  • 18. ETL Jobs • Use the same code for Delta and Batch jobs • Run as scheduled Databricks jobs • Use spot instances and fallback to on-demand • Deliver to Databricks as parquet tables
  • 19. Lessons Learned - Partition Pruning can save a lot of time and money Reduced query time from 2800 seconds to just 15 seconds. Don’t partition too many levels as it leads to worse metadata discovery performance and cost. 19
  • 20. Lessons Learned - High S3 costs: Lots of LIST Requests Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s metadata cache even after write operations. 20
  • 21. Running It All in Databricks - Jobs
  • 22. Running It All in Databricks - Spark
  • 23. Data Analysis & Tools We get the data in. What’s next? ● Monitoring ● Debugging ● Usage Analysis ● Product Design (A/B testing) 23
  • 24. Debugging Access to logs in a matter of seconds thanks to Apache Spark. 24
  • 25. Monitoring Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours. 25
  • 26. Usage Analysis + Product Design SparkR + ggplot2 = Match made in heaven 26
  • 27. Summary Databricks + Apache Spark create a unified platform for: - ETL - Data Warehousing - Data Analysis - Real time analytics Issues with DevOps out of the question: - No need to manage a huge cluster - Jobs are isolated, they don’t cannibalize each other’s resources - Can launch any Spark version
  • 28. Ongoing & Future Work Structured Streaming - Reduce Complexity of pipeline: Sync Daemon + Delta + Batch Jobs => Single Streaming Job - Reduce Latency Availability of data in seconds instead of minutes - Event Time Dashboards 28
  • 29. Try Apache Spark with Databricks 29 http://databricks.com/try
  • 30. Thank you. Have questions about ETL with Spark? Join us at the Databricks Booth 3.45-6.00pm!