SlideShare une entreprise Scribd logo
1  sur  16
Next Generation Apache Spark
Structured Streaming
Karthik Ramasamy
Head of Streaming, Databricks
Project #Lightspeed
Stream Processing
DBMS / CDC, Apps,
collection agents, IoT
devices
Streaming data lands in
message bus (e.g.
Pulsar, Kafka) / Files
Window aggregation
Pattern detection
Enrichment
Routing
Streaming
Transformations
Data continuously, incrementally processed as it appears
Triggers and Alerts
Real-time Analytics
Applications
Operational Applications
Explosion of streaming
Trillions of rows of data processed from thousands of sources
3
Manufacturing
Retail
Financial Services Healthcare
Energy Gaming
Technology &
Software
Media &
Entertainment
Fraud
Detection
Personalization Covid-19 Response Predictive
Maintenance
Smart Pricing Player Interaction
Analytics
Connected Cars,
Smart Homes
Content
Recommendations
Growth of Spark Structured Streaming
>150%
YoY streaming
job growth
Most downloaded streaming engine from Maven Central
1200+ customers
Logos using Structured Streaming on the Lakehouse
9x growth
in usage in 3 years
Spark Structured Streaming
Powers thousands of your everyday life applications today
Unified Batch & Streaming APIs
Lets developers use the same business logic across batch and stream processing
Fault Tolerance & Recovery
Automatic checkpointing & failure recovery allowing for reliable operations
Performance | Throughput
Handles > 14M events/sec (1.2T events per day) for the most challenging workloads
Flexible operations
Arbitrary logic and operations on the output of a streaming query
Stateful Processing
Support for stateful aggregations and joins along with watermarks for bounded states
New streaming applications
Proactive Maintenance in
Oil Drilling
Elevator Dispatch
Consistent
sub-second
latency
Ease of expressing
processing logic for
complex use cases
Integrations with
new cloud source
and sink systems
Tracing Microservices
1 2 3
Structured Streaming
needs to evolve to
satisfy these new
requirements
Project Lightspeed
Next generation of Spark Structured Streaming
Project Lightspeed
Faster and simpler stream processing
Predictable Low Latency
Target reduction in tail
latency by up to 2x
Enhanced Functionality
Advanced capabilities for
processing data with new
operators and easy to use APIs
Operations & Troubleshooting
Simplifying deployment,
operations, monitoring, and
troubleshooting
Connectors & Ecosystem
Improving ecosystem support for
connectors, authentication &
authorization features
Project Lightspeed - Predictable Low Latency
Faster bookkeeping - Offset management
External
Storage
Sequential Overlapped
External
Storage
Micro-batch -
1 Processing
External
Storage
Micro-batch -
2 Processing
External
Storage
async persist
offset ranges
async persist
offset ranges
time
Micro-batch -
3 Processing
async persist
offset ranges
440 ms 120 ms
73% improvement in latency for stateless pipelines
time
Micro-batch -
1 Processing
External
Storage
Micro-batch -
2 Processing
External
Storage
External
Storage
Persist
offset
ranges
Mark
batch
done
Persist
offset
ranges
Mark
batch
done
Project Lightspeed - Python as a first class citizen
agg()
count()
min()
max()
mean()
groupby()
orderby()
select()
selectExpr()
distinct()
where()
map()
mapValues()
flatMap()
flatMapValues()
csv()
json()
parquet()
orc()
schema()
text()
foreach()
foreachBatch()
Input & Output
Aggr & Grouping
awaitTermination()
exception()
explain()
status
stop()
Query Management
crossJoin()
crosstab()
join()
union()
unionAll()
Joins, etc
Filtering
createGlobalTempView()
createTempView()
drop()
drop_duplicates()
registerTempTable()
DDL Operations
window()
session_window()
Windowing
mapGroupWithState()
flatMapGroupWithState()
Arbitrary Stateful
Processing
Project Lightspeed - Improve Debuggability
Visualize the pipeline as data flow
Provide timeline view of metrics for operators
Group operator metrics by executor
Incorporate source and sink specific metrics
and many more…
Interested in Collaboration?
SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency
Karthik Ramasamy
Head of Streaming
Thank you

Contenu connexe

Similaire à Data Con LA 2022 Keynote

Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
VMware Tanzu
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
csching
 

Similaire à Data Con LA 2022 Keynote (20)

Scale Your Load Balancer from 0 to 1 million TPS on Azure
Scale Your Load Balancer from 0 to 1 million TPS on AzureScale Your Load Balancer from 0 to 1 million TPS on Azure
Scale Your Load Balancer from 0 to 1 million TPS on Azure
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Выявление и локализация проблем в сети с помощью инструментов Riverbed
Выявление и локализация проблем в сети с помощью инструментов RiverbedВыявление и локализация проблем в сети с помощью инструментов Riverbed
Выявление и локализация проблем в сети с помощью инструментов Riverbed
 
eMagic : A Complete Datacenter Management Suite
eMagic : A Complete Datacenter Management SuiteeMagic : A Complete Datacenter Management Suite
eMagic : A Complete Datacenter Management Suite
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Mashing Up Manufacturing
Mashing Up ManufacturingMashing Up Manufacturing
Mashing Up Manufacturing
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS Applications
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Activeeon use cases for cloud, digital transformation, IoT and big data autom...
Activeeon use cases for cloud, digital transformation, IoT and big data autom...Activeeon use cases for cloud, digital transformation, IoT and big data autom...
Activeeon use cases for cloud, digital transformation, IoT and big data autom...
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
Scaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInScaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedIn
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 

Plus de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
 

Dernier

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 

Dernier (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 

Data Con LA 2022 Keynote

  • 1. Next Generation Apache Spark Structured Streaming Karthik Ramasamy Head of Streaming, Databricks Project #Lightspeed
  • 2. Stream Processing DBMS / CDC, Apps, collection agents, IoT devices Streaming data lands in message bus (e.g. Pulsar, Kafka) / Files Window aggregation Pattern detection Enrichment Routing Streaming Transformations Data continuously, incrementally processed as it appears Triggers and Alerts Real-time Analytics Applications Operational Applications
  • 3. Explosion of streaming Trillions of rows of data processed from thousands of sources 3 Manufacturing Retail Financial Services Healthcare Energy Gaming Technology & Software Media & Entertainment Fraud Detection Personalization Covid-19 Response Predictive Maintenance Smart Pricing Player Interaction Analytics Connected Cars, Smart Homes Content Recommendations
  • 4. Growth of Spark Structured Streaming >150% YoY streaming job growth Most downloaded streaming engine from Maven Central
  • 5. 1200+ customers Logos using Structured Streaming on the Lakehouse 9x growth in usage in 3 years
  • 6. Spark Structured Streaming Powers thousands of your everyday life applications today Unified Batch & Streaming APIs Lets developers use the same business logic across batch and stream processing Fault Tolerance & Recovery Automatic checkpointing & failure recovery allowing for reliable operations Performance | Throughput Handles > 14M events/sec (1.2T events per day) for the most challenging workloads Flexible operations Arbitrary logic and operations on the output of a streaming query Stateful Processing Support for stateful aggregations and joins along with watermarks for bounded states
  • 7. New streaming applications Proactive Maintenance in Oil Drilling Elevator Dispatch Consistent sub-second latency Ease of expressing processing logic for complex use cases Integrations with new cloud source and sink systems Tracing Microservices 1 2 3
  • 8. Structured Streaming needs to evolve to satisfy these new requirements
  • 9. Project Lightspeed Next generation of Spark Structured Streaming
  • 10. Project Lightspeed Faster and simpler stream processing Predictable Low Latency Target reduction in tail latency by up to 2x Enhanced Functionality Advanced capabilities for processing data with new operators and easy to use APIs Operations & Troubleshooting Simplifying deployment, operations, monitoring, and troubleshooting Connectors & Ecosystem Improving ecosystem support for connectors, authentication & authorization features
  • 11. Project Lightspeed - Predictable Low Latency Faster bookkeeping - Offset management External Storage Sequential Overlapped External Storage Micro-batch - 1 Processing External Storage Micro-batch - 2 Processing External Storage async persist offset ranges async persist offset ranges time Micro-batch - 3 Processing async persist offset ranges 440 ms 120 ms 73% improvement in latency for stateless pipelines time Micro-batch - 1 Processing External Storage Micro-batch - 2 Processing External Storage External Storage Persist offset ranges Mark batch done Persist offset ranges Mark batch done
  • 12. Project Lightspeed - Python as a first class citizen agg() count() min() max() mean() groupby() orderby() select() selectExpr() distinct() where() map() mapValues() flatMap() flatMapValues() csv() json() parquet() orc() schema() text() foreach() foreachBatch() Input & Output Aggr & Grouping awaitTermination() exception() explain() status stop() Query Management crossJoin() crosstab() join() union() unionAll() Joins, etc Filtering createGlobalTempView() createTempView() drop() drop_duplicates() registerTempTable() DDL Operations window() session_window() Windowing mapGroupWithState() flatMapGroupWithState() Arbitrary Stateful Processing
  • 13. Project Lightspeed - Improve Debuggability Visualize the pipeline as data flow Provide timeline view of metrics for operators Group operator metrics by executor Incorporate source and sink specific metrics
  • 15. Interested in Collaboration? SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency
  • 16. Karthik Ramasamy Head of Streaming Thank you

Notes de l'éditeur

  1. <TRANSITION TO KARTHIK> So what happened in the last 6-9 months is that we’ve invested heavily on building up a strong streaming team that’s actually going to take structured streaming and elevate to the next level We actually have the CEO of Pulsar, Karthik who is going to present this talk. He built a very popular streaming engine prior to this that many of you may have used… and today we are very excited to introduce Karthik to share our vision to grow Structured Streaming to the next level….
  2. We have seen an explosion of streaming applications across all industries… In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine…. In finance……In retail….. In healthcare…. In manufacturing…. In retail…….
  3. We have seen an explosion of streaming applications across all industries… In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine…. In finance……In retail….. In healthcare…. In manufacturing…. In retail…….
  4. KARTHIK…. Thank you Ali We are very data-driven at Databricks and we’ve been looking at the metrics, and from all numbers we’ve seen, this is the most surprising statistic that I’ve seen at Databricks. And we haven’t even done much on this, in fact we developed Structured Streaming many years ago and not too much investment went into it and still the growth is 160% of a large base. This is a significant portion of our revenue. Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities. The majority of streaming workloads we saw were customers migrating their batch workloads to take advantage of the lower latency, fault tolerance, and support for incremental processing that streaming has to offer. The result is that we have seen tremendous adoption from streaming customers for both open source Spark and Databricks. The graph below shows the weekly number of streaming jobs on Databricks over the past three years, which has grown from thousands to 3+ millions, and is still accelerating. ………. Per Matei - to update, not to use graph, but to say a double digit percentage of our workflows is streaming and have a number here and we see that increasing over time. X many trillions of records p/day.
  5. ..and many of our customers, from enterprises to startups have and are continuing adopting streaming in the lakehouse….
  6. Why do I believe Spark Structured Streaming is growing? Several properties of Structured Streaming have made it popular and here are the top 5. Unification - The foremost advantage of Structured Streaming is that it uses the same API as batch processing,, making the transition to real-time processing from batch much simpler. Fault Tolerance & Recovery - Structured Streaming checkpoints state automatically at every stage of processing. When a failure occurs, it automatically recovers from the previous state. The failure recovery is very fast since it is restricted to failed tasks as opposed to restarting the entire streaming pipeline in other systems. AFAIK, SS runs in spot instances making streaming cost effective Performance - Structured Streaming provides very high throughput with seconds of latency at a lower cost, taking full advantage of the performance optimizations in the Spark SQL engine.. Flexible Operations - The ability to apply arbitrary logic and operations on the output of a streaming query using foreachBatch. This enables developers to perform operations like upserts, writes to multiple sinks, as well as interaction with external data sources. Over 40% of our users on Databricks take advantage of this feature. Stateful Processing - Support for stateful aggregations and joins along with watermarks for bounded state and late order processing. In addition, arbitrary stateful operations with [flat]mapGroupsWithState backed by a RocksDB state store are provided for efficient and fault-tolerant state management (as of Spark 3.2).
  7. As SS grew in leaps and bounds, developers started using it for emerging new applications such as … Monitor expensive drill bits continuously and stop them from hitting rock surfaces Continuously monitor the data from elevator for emergencies and quickly alert the dispatch Stitch the requests and responses from logs of microservices that serve a web request for tracing and troubleshooting These exposed some of the shortcomings of SS such as … . I think if we can address all of these, we will be able to increase adoption and see skyrocketed growth. So,
  8. What are we doing about?
  9. I am very excited to announce that we are launching Project Lightspeed to take SS into next generation
  10. Project Lightspeed advances SS across four pillars… …. In the next few slides, I will give a glimpse of some of the Lightspeed features
  11. SS has several bookkeeping - (b) plan offset ranges, (e) mark batch done. Forced into storage (b) and (a) and in sequence. Increased latency In default trigger, eliminate (e) and overlap the execution of mb with storing the offset range async
  12. SS pipelines can be programmed using multiple languages Java, Scala, Python and SQL. Python is a popular choice. Python provides several API …. But there is a gap. Arbitrary Stateful processing - needed for exponential weighted avg. Key challenge with this API is executing arbitrary python code in a JVM system.
  13. Streaming pipelines are brittle. There can be several reasons - surge in data to be processed, resources not adequately provisioned, bug in user code. SS provides tons of metrics ´& logs at micro batch level.