Designing and Building Next Generation Data Pipelines at Scale with Structured Streaming with Burak Yavuz

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft à Databricks
Designing and Building Next
Generation Data Pipelines at Scale
with Structured Streaming
Burak Yavuz
October 4th 2018, London #SAISDev15
Who	am	I
● Software	Engineer	– Databricks
- “We	make	your	streams	come	true”
● Apache	Spark	Committer
● MS	in	Management	Science	&	Engineering	-
Stanford	University
● BS	in	Mechanical	Engineering	- Bogazici
University,	Istanbul
Today, we’re going to ride a
time machine
Let’s go back to 2014…
Evolution of Data Pipelines @ Databricks
Circa MMXIV
Deployment
Deployment
Deployment
Deployment
S3-copy Batch Job Every Hour*
Amazon S3
Data Pipeline V1
• Took 1 engineer ~1 week to implement
• Was pretty robust for the early days of Databricks
• … until we got to 30+ customers
• File listing on S3 quickly became the bottleneck
• *eventually, job duration rose to 8 hours
Fast-forward to 2016…
Amazon Kinesis
Evolution of Data Pipelines @ Databricks
Circa 2016
Deployment
Deployment
Deployment
Deployment
Stream Continuously
Compact Once a day
Data Pipeline V2
• Scaled very well
• ETL’ing the data became fast
• Took 2 engineers ~8 months
• Query performance / experience got worse
• Lots of small files
• Compaction jobs impacted queries (FileNotFoundExceptions)
• HiveMetaStore quickly became bottleneck
• REFRESH TABLE / MSCK REPAIR TABLE / ALTER TABLE ADD PARTITIONS
• Logic became more complicated. Pipeline was less robust
• Fixing mistakes in the data became harder
What Happened?
Examples of Data Mistakes
• A field’s unit changed from MB to GB
• Schema inference / mismatch
• An integer column in JSON started getting inferred as longs after Spark
upgrade. Some Parquet files had ints, some had longs
• A different type of log got introduced to the system. All of a sudden a table
with 8 columns had 32 new columns introduced
• Garbage data caused by partial failures
Problems in Data Pipelines
• Correctness
• Lack of atomicity leads to pockets of duplicate data
• Bookkeeping of what to process is tedious – late / out-of-order data
• Schema Management
• Maintaining Data Hygiene – checks/corrections
• Performance
• Listing files on blob storage systems is slow
• Lots of small files hurt query performance
• HiveMetaStore experience is horrendous
– Doesn’t scale well
– Having to call MSCK REPAIR TABLE and REFRESH TABLE all the time
Enter Structured Streaming
You care about your business logic
Structured Streaming cares about incrementally running your
logic on new data over time
Correctness Problems in Data Pipelines
• Lack of atomicity leads to pockets of duplicate data
• Structured Streaming writes a manifest file to “commit” data to a file sink,
which atomically appear to Spark
• Bookkeeping of what to process is tedious – late / out-of-order
data
• The engine keeps track of what data is processed, and what data is new
• Watermark support allows “correct” processing of late – out of order data.
• Schema Management
• Maintaining Data Hygiene – checks/corrections
Performance Problems in Data Pipelines
• Listing files on blob storage systems is slow
• The manifest lists which files were written by Structured Streaming,
therefore no longer need to list
• Lots of small files hurt query performance
• HiveMetaStore experience is horrendous
• Doesn’t scale well
• Having to call MSCK REPAIR TABLE and REFRESH TABLE all the time
Enter Databricks Delta
• Separates Compute from Storage
• No dependency on HiveMetaStore – Manages metadata internally
• Scales to Billions of partitions and/or files
• Supports Batch/Stream Reads/Writes
• ACID Transactions
• Compaction and Indexing
• Schema Management / Invariant Support
• Tables Auto-Update and provide Snapshot Isolation
• DELETE / UPDATE / MERGE
• Leverages data locality through DBIO Caching
• Coming Q4 2018 => Querying old versions of the table + Rollbacks
Correctness Problems in Data Pipelines
• Lack of atomicity leads to pockets of duplicate data
• Delta provides ACID transactions and Snapshot Isolation
• Bookkeeping of what to process is tedious – late / out-of-order data
• Structured Streaming handles this. Streaming into Delta provides exactly-once
semantics
• Schema Management
• Delta manages the schema of the table internally and allows “safe” (opt-in)
evolutions
• Maintaining Data Hygiene – checks/corrections
• Delta supports DELETE / UPDATE to delete/fix records
• Delta supports Invariants (NOT NULL, enum in (‘A’, ‘B’, ‘C’))
Performance Problems in Data Pipelines
• Listing files on blob storage systems is slow
• Delta doesn’t need to list files. It keeps file information in its state
• Lots of small files hurt query performance
• Delta’s OPTIMIZE method compacts data without affecting in-flight queries
• HiveMetaStore experience is horrendous
• Delta uses Spark jobs to compute its state, therefore metadata is scalable!
• Delta auto-updates tables, therefore you don’t need REFRESH TABLE /
MSCK REPAIR TABLE / ALTER TABLE ADD PARTITIONS, etc
DELTA
Reporting
Streaming
Analytics
Delta @
DELTA
DELTA
DELTA
Summary TablesRaw Tables
Larger Size Longer Retention
New Requirements
• Launched
• Can’t leverage Kinesis anymore
• Have to replicate pipeline in many Azure Regions
New Requirements
GDPR
GDPR
GDPR
GDPR
GDPR
GDPR
Evolution of Data Pipelines @ Databricks
Event Based
DELTA
DELTA
DELTA
DELTA
DELTA
DELTA Reporting
Streaming
Analytics
Bronze Tables Silver Tables Gold Tables
Event Based File Sources
• Launched Structured Streaming connectors:
• s3-sqs on AWS (DBR 3.5)
• abs-aqs on Azure (DBR 5.0)
• As blobs are generated:
• Events are published to SQS/AQS
• Spark reads these events
• Then reads original files from
blob storage system Azure
Blob
Storage Event Grid
Queue Storage
AWS SQS
AWS S3
Properties of Bronze/Silver/Gold
• Bronze tables
• No data processing
• Deduplication + JSON => Parquet conversion
• Data kept around for a couple weeks in order to fix mistakes just in case
• Silver tables
• Directly queryable tables
• PII masking/redaction
• Gold tables
• Materialized views of silver tables
• Curated tables by the Data Science team
Dealing with GDPR
• Delta’s in-built support for DELETE and UPDATE make data
subject requests (DSR) tractable
• Delete or update the records
• Run VACUUM after 7 days (configurable) and the old data is gone!
• Check out blog post for more details!
Data Pipeline V3
• Event base file sources avoid file listing altogether
• Scales even better than V2
• Easy to replicate across Clouds and regions
• Run Once trigger gives all benefits of Structured Streaming with
cost benefits of running batch jobs
• Delta makes GDPR easy
• Latency went from 15 seconds to 5 minutes for general case
Other Techniques
• Leverage S3 Inventory and Delta’s transaction log to unearth
value from 500 million JSON files
• Using S3 Select to reduce data size
• Using Continuous Processing to process data from Kafka and
write to Kafka at sub-millisecond latencies
• All available with Databricks Runtime!
Summary
• File listing and many small files hurt performance
• Using Delta and event based notification sources help us avoid listing
• Delta’s in-built compaction and indexing alleviates small file problem
• HiveMetaStore can become the bottleneck for large tables
• Partial / distributed failures can taint tables
• Schema Management and Data Hygiene are hard problems
• GDPR adds extra complexity to pipelines through DSRs
Summary
• File listing and many small files hurt performance
• HiveMetaStore can become the bottleneck for large tables
• Delta uses Spark jobs to manage its metadata to scale to billions of files
• Delta auto-updates => No need to call REFRESH TABLE with Spark
• No need to add/remove partitions, no need for MSCK REPAIR TABLE
• Partial / distributed failures can taint tables
• Schema Management and Data Hygiene are hard problems
• GDPR adds extra complexity to pipelines through DSRs
Summary
• File listing and many small files hurt performance
• HiveMetaStore can become the bottleneck for large tables
• Partial / distributed failures can taint tables
• Delta’s ACID transactions guard us against garbage data
• Always get a consistent (possibly stale) view of your table with Delta
• Schema Management and Data Hygiene are hard problems
• GDPR adds extra complexity to pipelines through DSRs
Summary
• File listing and many small files hurt performance
• HiveMetaStore can become the bottleneck for large tables
• Partial / distributed failures can taint tables
• Schema Management and Data Hygiene are hard problems
• Delta has in-built schema management to only allow safe changes
• Invariants in Delta prevent unexpected data from polluting tables
• Delta architecture (Bronze-Silver-Gold tables) combined with Delta makes
backfills and corrections easier
• Delta’s upcoming support for rollbacks will make corrections effortless
• GDPR adds extra complexity to pipelines through DSRs
Summary
• File listing and many small files hurt performance
• HiveMetaStore can become the bottleneck for large tables
• Partial / distributed failures can taint tables
• Schema Management and Data Hygiene are hard problems
• GDPR adds extra complexity to pipelines through DSRs
• UPDATE / DELETE support in Delta makes this easier
Further Reading
• On Structured Streaming
• https://databricks.com/blog/2017/08/24/anthology-of-technical-assets-on-apache-sparks-
structured-streaming.html
• On Delta:
• https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-
real-time-big-data.html
• https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-
databricks-delta.html
• https://databricks.com/blog/2018/09/10/building-the-fastest-dnaseq-pipeline-at-scale.html
• https://databricks.com/blog/2018/07/19/simplify-streaming-stock-data-analysis-using-
databricks-delta.html
• https://databricks.com/blog/2018/07/02/build-a-mobile-gaming-events-data-pipeline-with-
databricks-delta.html
Thank You
“Do you have any questions for my prepared answers?”
– Henry Kissinger
1 sur 34

Recommandé

Making Apache Spark Better with Delta Lake par
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K vues40 diapositives
Change Data Feed in Delta par
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
1.6K vues21 diapositives
Delta Lake with Azure Databricks par
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
418 vues28 diapositives
Data Discovery at Databricks with Amundsen par
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
1.2K vues45 diapositives
Common Strategies for Improving Performance on Your Delta Lakehouse par
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
715 vues14 diapositives
Delta lake and the delta architecture par
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
1K vues22 diapositives

Contenu connexe

Tendances

The delta architecture par
The delta architectureThe delta architecture
The delta architecturePrakash Chockalingam
537 vues41 diapositives
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga... par
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
136 vues23 diapositives
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard par
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
1.3K vues42 diapositives
Incremental View Maintenance with Coral, DBT, and Iceberg par
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
594 vues57 diapositives
Understanding Query Plans and Spark UIs par
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
4.7K vues50 diapositives
Free Training: How to Build a Lakehouse par
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
3.3K vues42 diapositives

Tendances(20)

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga... par DataScienceConferenc1
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard par Paris Data Engineers !
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Incremental View Maintenance with Coral, DBT, and Iceberg par Walaa Eldin Moustafa
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Understanding Query Plans and Spark UIs par Databricks
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks4.7K vues
Free Training: How to Build a Lakehouse par Databricks
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks3.3K vues
The Parquet Format and Performance Optimization Opportunities par Databricks
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks8.1K vues
Operating and Supporting Delta Lake in Production par Databricks
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks333 vues
Optimizing Delta/Parquet Data Lakes for Apache Spark par Databricks
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks2.5K vues
HBase in Practice par larsgeorge
HBase in PracticeHBase in Practice
HBase in Practice
larsgeorge5.6K vues
Tame the small files problem and optimize data layout for streaming ingestion... par Flink Forward
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward798 vues
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg par Anant Corporation
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Databricks Delta Lake and Its Benefits par Databricks
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks5.1K vues
Dynamic Partition Pruning in Apache Spark par Databricks
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks4.9K vues
Optimizing Apache Spark SQL Joins par Databricks
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks44.9K vues
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... par Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K vues
Got data?… now what? An introduction to modern data platforms par JamesAnderson599331
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang par Databricks
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Databricks2.9K vues
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi... par Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks8.4K vues

Similaire à Designing and Building Next Generation Data Pipelines at Scale with Structured Streaming with Burak Yavuz

Data modeling trends for analytics par
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
319 vues33 diapositives
Introduction SQL Analytics on Lakehouse Architecture par
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
5.8K vues52 diapositives
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ... par
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services
2.3K vues43 diapositives
Data Stream Processing for Beginners with Kafka and CDC par
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCAbhijit Kumar
109 vues20 diapositives
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters... par
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
2.8K vues30 diapositives
Remote DBA Experts SQL Server 2008 New Features par
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts
4.5K vues37 diapositives

Similaire à Designing and Building Next Generation Data Pipelines at Scale with Structured Streaming with Burak Yavuz(20)

Data modeling trends for analytics par Ike Ellis
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
Ike Ellis319 vues
Introduction SQL Analytics on Lakehouse Architecture par Databricks
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks5.8K vues
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ... par Amazon Web Services
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
Data Stream Processing for Beginners with Kafka and CDC par Abhijit Kumar
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar109 vues
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters... par Databricks
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks2.8K vues
Remote DBA Experts SQL Server 2008 New Features par Remote DBA Experts
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts4.5K vues
Dynamic DDL: Adding structure to streaming IoT data on the fly par DataWorks Summit
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit1.4K vues
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori par walk2talk srl
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca SartoriCCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
walk2talk srl131 vues
An AMIS Overview of Oracle database 12c (12.1) par Marco Gralike
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
Marco Gralike4.5K vues
Kb 40 kevin_klineukug_reading20070717[1] par shuwutong
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
shuwutong665 vues
Data Ingestion Engine par Adam Doyle
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
Adam Doyle894 vues
Data Vault Automation at the Bijenkorf par Rob Winters
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
Rob Winters4.3K vues
So You Want to Build a Data Lake? par David P. Moore
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore152 vues
Data Lakehouse, Data Mesh, and Data Fabric (r2) par James Serra
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra6.2K vues
AppSphere 15 - Is the database affecting your critical business transactions? par AppDynamics
AppSphere 15 - Is the database affecting your critical business transactions?AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?
AppDynamics386 vues
Data Lakehouse, Data Mesh, and Data Fabric (r1) par James Serra
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra5.4K vues
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore par DataWorks Summit
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
DataWorks Summit816 vues

Plus de Databricks

DW Migration Webinar-March 2022.pptx par
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K vues25 diapositives
Data Lakehouse Symposium | Day 1 | Part 1 par
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K vues43 diapositives
Data Lakehouse Symposium | Day 1 | Part 2 par
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
739 vues16 diapositives
Data Lakehouse Symposium | Day 4 par
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K vues74 diapositives
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop par
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K vues64 diapositives
Democratizing Data Quality Through a Centralized Platform par
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K vues36 diapositives

Plus de Databricks(20)

DW Migration Webinar-March 2022.pptx par Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K vues
Data Lakehouse Symposium | Day 1 | Part 1 par Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K vues
Data Lakehouse Symposium | Day 1 | Part 2 par Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks739 vues
Data Lakehouse Symposium | Day 4 par Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K vues
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop par Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K vues
Democratizing Data Quality Through a Centralized Platform par Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K vues
Learn to Use Databricks for Data Science par Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K vues
Why APM Is Not the Same As ML Monitoring par Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 vues
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix par Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks688 vues
Stage Level Scheduling Improving Big Data and AI Integration par Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 vues
Simplify Data Conversion from Spark to TensorFlow and PyTorch par Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K vues
Scaling your Data Pipelines with Apache Spark on Kubernetes par Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K vues
Scaling and Unifying SciKit Learn and Apache Spark Pipelines par Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 vues
Sawtooth Windows for Feature Aggregations par Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks604 vues
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink par Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks675 vues
Re-imagine Data Monitoring with whylogs and Spark par Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks550 vues
Raven: End-to-end Optimization of ML Prediction Queries par Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks448 vues
Processing Large Datasets for ADAS Applications using Apache Spark par Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks512 vues
Massive Data Processing in Adobe Using Delta Lake par Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 vues
Machine Learning CI/CD for Email Attack Detection par Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 vues

Dernier

CRIJ4385_Death Penalty_F23.pptx par
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
6 vues24 diapositives
PROGRAMME.pdf par
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdfHiNedHaJar
17 vues13 diapositives
Organic Shopping in Google Analytics 4.pdf par
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
10 vues13 diapositives
RuleBookForTheFairDataEconomy.pptx par
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptxnoraelstela1
67 vues16 diapositives
Building Real-Time Travel Alerts par
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel AlertsTimothy Spann
109 vues48 diapositives
Supercharging your Data with Azure AI Search and Azure OpenAI par
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAIPeter Gallagher
37 vues32 diapositives

Dernier(20)

CRIJ4385_Death Penalty_F23.pptx par yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 vues
Organic Shopping in Google Analytics 4.pdf par GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 vues
RuleBookForTheFairDataEconomy.pptx par noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 vues
Building Real-Time Travel Alerts par Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann109 vues
Supercharging your Data with Azure AI Search and Azure OpenAI par Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 vues
Advanced_Recommendation_Systems_Presentation.pptx par neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
UNEP FI CRS Climate Risk Results.pptx par pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 vues
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx par DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Short Story Assignment by Kelly Nguyen par kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 vues
Chapter 3b- Process Communication (1) (1)(1) (1).pptx par ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
Data structure and algorithm. par Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 vues
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation par DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Cross-network in Google Analytics 4.pdf par GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 vues
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf par vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 vues

Designing and Building Next Generation Data Pipelines at Scale with Structured Streaming with Burak Yavuz

  • 1. Designing and Building Next Generation Data Pipelines at Scale with Structured Streaming Burak Yavuz October 4th 2018, London #SAISDev15
  • 2. Who am I ● Software Engineer – Databricks - “We make your streams come true” ● Apache Spark Committer ● MS in Management Science & Engineering - Stanford University ● BS in Mechanical Engineering - Bogazici University, Istanbul
  • 3. Today, we’re going to ride a time machine
  • 4. Let’s go back to 2014…
  • 5. Evolution of Data Pipelines @ Databricks Circa MMXIV Deployment Deployment Deployment Deployment S3-copy Batch Job Every Hour* Amazon S3
  • 6. Data Pipeline V1 • Took 1 engineer ~1 week to implement • Was pretty robust for the early days of Databricks • … until we got to 30+ customers • File listing on S3 quickly became the bottleneck • *eventually, job duration rose to 8 hours
  • 8. Amazon Kinesis Evolution of Data Pipelines @ Databricks Circa 2016 Deployment Deployment Deployment Deployment Stream Continuously Compact Once a day
  • 9. Data Pipeline V2 • Scaled very well • ETL’ing the data became fast • Took 2 engineers ~8 months • Query performance / experience got worse • Lots of small files • Compaction jobs impacted queries (FileNotFoundExceptions) • HiveMetaStore quickly became bottleneck • REFRESH TABLE / MSCK REPAIR TABLE / ALTER TABLE ADD PARTITIONS • Logic became more complicated. Pipeline was less robust • Fixing mistakes in the data became harder
  • 11. Examples of Data Mistakes • A field’s unit changed from MB to GB • Schema inference / mismatch • An integer column in JSON started getting inferred as longs after Spark upgrade. Some Parquet files had ints, some had longs • A different type of log got introduced to the system. All of a sudden a table with 8 columns had 32 new columns introduced • Garbage data caused by partial failures
  • 12. Problems in Data Pipelines • Correctness • Lack of atomicity leads to pockets of duplicate data • Bookkeeping of what to process is tedious – late / out-of-order data • Schema Management • Maintaining Data Hygiene – checks/corrections • Performance • Listing files on blob storage systems is slow • Lots of small files hurt query performance • HiveMetaStore experience is horrendous – Doesn’t scale well – Having to call MSCK REPAIR TABLE and REFRESH TABLE all the time
  • 13. Enter Structured Streaming You care about your business logic Structured Streaming cares about incrementally running your logic on new data over time
  • 14. Correctness Problems in Data Pipelines • Lack of atomicity leads to pockets of duplicate data • Structured Streaming writes a manifest file to “commit” data to a file sink, which atomically appear to Spark • Bookkeeping of what to process is tedious – late / out-of-order data • The engine keeps track of what data is processed, and what data is new • Watermark support allows “correct” processing of late – out of order data. • Schema Management • Maintaining Data Hygiene – checks/corrections
  • 15. Performance Problems in Data Pipelines • Listing files on blob storage systems is slow • The manifest lists which files were written by Structured Streaming, therefore no longer need to list • Lots of small files hurt query performance • HiveMetaStore experience is horrendous • Doesn’t scale well • Having to call MSCK REPAIR TABLE and REFRESH TABLE all the time
  • 16. Enter Databricks Delta • Separates Compute from Storage • No dependency on HiveMetaStore – Manages metadata internally • Scales to Billions of partitions and/or files • Supports Batch/Stream Reads/Writes • ACID Transactions • Compaction and Indexing • Schema Management / Invariant Support • Tables Auto-Update and provide Snapshot Isolation • DELETE / UPDATE / MERGE • Leverages data locality through DBIO Caching • Coming Q4 2018 => Querying old versions of the table + Rollbacks
  • 17. Correctness Problems in Data Pipelines • Lack of atomicity leads to pockets of duplicate data • Delta provides ACID transactions and Snapshot Isolation • Bookkeeping of what to process is tedious – late / out-of-order data • Structured Streaming handles this. Streaming into Delta provides exactly-once semantics • Schema Management • Delta manages the schema of the table internally and allows “safe” (opt-in) evolutions • Maintaining Data Hygiene – checks/corrections • Delta supports DELETE / UPDATE to delete/fix records • Delta supports Invariants (NOT NULL, enum in (‘A’, ‘B’, ‘C’))
  • 18. Performance Problems in Data Pipelines • Listing files on blob storage systems is slow • Delta doesn’t need to list files. It keeps file information in its state • Lots of small files hurt query performance • Delta’s OPTIMIZE method compacts data without affecting in-flight queries • HiveMetaStore experience is horrendous • Delta uses Spark jobs to compute its state, therefore metadata is scalable! • Delta auto-updates tables, therefore you don’t need REFRESH TABLE / MSCK REPAIR TABLE / ALTER TABLE ADD PARTITIONS, etc
  • 20. New Requirements • Launched • Can’t leverage Kinesis anymore • Have to replicate pipeline in many Azure Regions
  • 22. Evolution of Data Pipelines @ Databricks Event Based DELTA DELTA DELTA DELTA DELTA DELTA Reporting Streaming Analytics Bronze Tables Silver Tables Gold Tables
  • 23. Event Based File Sources • Launched Structured Streaming connectors: • s3-sqs on AWS (DBR 3.5) • abs-aqs on Azure (DBR 5.0) • As blobs are generated: • Events are published to SQS/AQS • Spark reads these events • Then reads original files from blob storage system Azure Blob Storage Event Grid Queue Storage AWS SQS AWS S3
  • 24. Properties of Bronze/Silver/Gold • Bronze tables • No data processing • Deduplication + JSON => Parquet conversion • Data kept around for a couple weeks in order to fix mistakes just in case • Silver tables • Directly queryable tables • PII masking/redaction • Gold tables • Materialized views of silver tables • Curated tables by the Data Science team
  • 25. Dealing with GDPR • Delta’s in-built support for DELETE and UPDATE make data subject requests (DSR) tractable • Delete or update the records • Run VACUUM after 7 days (configurable) and the old data is gone! • Check out blog post for more details!
  • 26. Data Pipeline V3 • Event base file sources avoid file listing altogether • Scales even better than V2 • Easy to replicate across Clouds and regions • Run Once trigger gives all benefits of Structured Streaming with cost benefits of running batch jobs • Delta makes GDPR easy • Latency went from 15 seconds to 5 minutes for general case
  • 27. Other Techniques • Leverage S3 Inventory and Delta’s transaction log to unearth value from 500 million JSON files • Using S3 Select to reduce data size • Using Continuous Processing to process data from Kafka and write to Kafka at sub-millisecond latencies • All available with Databricks Runtime!
  • 28. Summary • File listing and many small files hurt performance • Using Delta and event based notification sources help us avoid listing • Delta’s in-built compaction and indexing alleviates small file problem • HiveMetaStore can become the bottleneck for large tables • Partial / distributed failures can taint tables • Schema Management and Data Hygiene are hard problems • GDPR adds extra complexity to pipelines through DSRs
  • 29. Summary • File listing and many small files hurt performance • HiveMetaStore can become the bottleneck for large tables • Delta uses Spark jobs to manage its metadata to scale to billions of files • Delta auto-updates => No need to call REFRESH TABLE with Spark • No need to add/remove partitions, no need for MSCK REPAIR TABLE • Partial / distributed failures can taint tables • Schema Management and Data Hygiene are hard problems • GDPR adds extra complexity to pipelines through DSRs
  • 30. Summary • File listing and many small files hurt performance • HiveMetaStore can become the bottleneck for large tables • Partial / distributed failures can taint tables • Delta’s ACID transactions guard us against garbage data • Always get a consistent (possibly stale) view of your table with Delta • Schema Management and Data Hygiene are hard problems • GDPR adds extra complexity to pipelines through DSRs
  • 31. Summary • File listing and many small files hurt performance • HiveMetaStore can become the bottleneck for large tables • Partial / distributed failures can taint tables • Schema Management and Data Hygiene are hard problems • Delta has in-built schema management to only allow safe changes • Invariants in Delta prevent unexpected data from polluting tables • Delta architecture (Bronze-Silver-Gold tables) combined with Delta makes backfills and corrections easier • Delta’s upcoming support for rollbacks will make corrections effortless • GDPR adds extra complexity to pipelines through DSRs
  • 32. Summary • File listing and many small files hurt performance • HiveMetaStore can become the bottleneck for large tables • Partial / distributed failures can taint tables • Schema Management and Data Hygiene are hard problems • GDPR adds extra complexity to pipelines through DSRs • UPDATE / DELETE support in Delta makes this easier
  • 33. Further Reading • On Structured Streaming • https://databricks.com/blog/2017/08/24/anthology-of-technical-assets-on-apache-sparks- structured-streaming.html • On Delta: • https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for- real-time-big-data.html • https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with- databricks-delta.html • https://databricks.com/blog/2018/09/10/building-the-fastest-dnaseq-pipeline-at-scale.html • https://databricks.com/blog/2018/07/19/simplify-streaming-stock-data-analysis-using- databricks-delta.html • https://databricks.com/blog/2018/07/02/build-a-mobile-gaming-events-data-pipeline-with- databricks-delta.html
  • 34. Thank You “Do you have any questions for my prepared answers?” – Henry Kissinger