➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
Data Con LA 2022 Keynote
1. Next Generation Apache Spark
Structured Streaming
Karthik Ramasamy
Head of Streaming, Databricks
Project #Lightspeed
2. Stream Processing
DBMS / CDC, Apps,
collection agents, IoT
devices
Streaming data lands in
message bus (e.g.
Pulsar, Kafka) / Files
Window aggregation
Pattern detection
Enrichment
Routing
Streaming
Transformations
Data continuously, incrementally processed as it appears
Triggers and Alerts
Real-time Analytics
Applications
Operational Applications
3. Explosion of streaming
Trillions of rows of data processed from thousands of sources
3
Manufacturing
Retail
Financial Services Healthcare
Energy Gaming
Technology &
Software
Media &
Entertainment
Fraud
Detection
Personalization Covid-19 Response Predictive
Maintenance
Smart Pricing Player Interaction
Analytics
Connected Cars,
Smart Homes
Content
Recommendations
4. Growth of Spark Structured Streaming
>150%
YoY streaming
job growth
Most downloaded streaming engine from Maven Central
6. Spark Structured Streaming
Powers thousands of your everyday life applications today
Unified Batch & Streaming APIs
Lets developers use the same business logic across batch and stream processing
Fault Tolerance & Recovery
Automatic checkpointing & failure recovery allowing for reliable operations
Performance | Throughput
Handles > 14M events/sec (1.2T events per day) for the most challenging workloads
Flexible operations
Arbitrary logic and operations on the output of a streaming query
Stateful Processing
Support for stateful aggregations and joins along with watermarks for bounded states
7. New streaming applications
Proactive Maintenance in
Oil Drilling
Elevator Dispatch
Consistent
sub-second
latency
Ease of expressing
processing logic for
complex use cases
Integrations with
new cloud source
and sink systems
Tracing Microservices
1 2 3
10. Project Lightspeed
Faster and simpler stream processing
Predictable Low Latency
Target reduction in tail
latency by up to 2x
Enhanced Functionality
Advanced capabilities for
processing data with new
operators and easy to use APIs
Operations & Troubleshooting
Simplifying deployment,
operations, monitoring, and
troubleshooting
Connectors & Ecosystem
Improving ecosystem support for
connectors, authentication &
authorization features
13. Project Lightspeed - Improve Debuggability
Visualize the pipeline as data flow
Provide timeline view of metrics for operators
Group operator metrics by executor
Incorporate source and sink specific metrics
<TRANSITION TO KARTHIK>
So what happened in the last 6-9 months is that we’ve invested heavily on building up a strong streaming team that’s actually going to take structured streaming and elevate to the next level
We actually have the CEO of Pulsar, Karthik who is going to present this talk. He built a very popular streaming engine prior to this that many of you may have used…
and today we are very excited to introduce Karthik to share our vision to grow Structured Streaming to the next level….
We have seen an explosion of streaming applications across all industries…
In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine….
In finance……In retail….. In healthcare…. In manufacturing…. In retail…….
We have seen an explosion of streaming applications across all industries…
In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine….
In finance……In retail….. In healthcare…. In manufacturing…. In retail…….
KARTHIK….
Thank you Ali
We are very data-driven at Databricks and we’ve been looking at the metrics, and from all numbers we’ve seen, this is the most surprising statistic that I’ve seen at Databricks.
And we haven’t even done much on this, in fact we developed Structured Streaming many years ago and not too much investment went into it and still the growth is 160% of a large base. This is a significant portion of our revenue.
Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities. The majority of streaming workloads we saw were customers migrating their batch workloads to take advantage of the lower latency, fault tolerance, and support for incremental processing that streaming has to offer. The result is that we have seen tremendous adoption from streaming customers for both open source Spark and Databricks. The graph below shows the weekly number of streaming jobs on Databricks over the past three years, which has grown from thousands to 3+ millions, and is still accelerating.
……….
Per Matei - to update, not to use graph, but to say a double digit percentage of our workflows is streaming and have a number here and we see that increasing over time. X many trillions of records p/day.
..and many of our customers, from enterprises to startups have and are continuing adopting streaming in the lakehouse….
Why do I believe Spark Structured Streaming is growing? Several properties of Structured Streaming have made it popular and here are the top 5.
Unification - The foremost advantage of Structured Streaming is that it uses the same API as batch processing,, making the transition to real-time processing from batch much simpler.
Fault Tolerance & Recovery - Structured Streaming checkpoints state automatically at every stage of processing. When a failure occurs, it automatically recovers from the previous state. The failure recovery is very fast since it is restricted to failed tasks as opposed to restarting the entire streaming pipeline in other systems. AFAIK, SS runs in spot instances making streaming cost effective
Performance - Structured Streaming provides very high throughput with seconds of latency at a lower cost, taking full advantage of the performance optimizations in the Spark SQL engine..
Flexible Operations - The ability to apply arbitrary logic and operations on the output of a streaming query using foreachBatch. This enables developers to perform operations like upserts, writes to multiple sinks, as well as interaction with external data sources. Over 40% of our users on Databricks take advantage of this feature.
Stateful Processing - Support for stateful aggregations and joins along with watermarks for bounded state and late order processing. In addition, arbitrary stateful operations with [flat]mapGroupsWithState backed by a RocksDB state store are provided for efficient and fault-tolerant state management (as of Spark 3.2).
As SS grew in leaps and bounds, developers started using it for emerging new applications such as …
Monitor expensive drill bits continuously and stop them from hitting rock surfaces
Continuously monitor the data from elevator for emergencies and quickly alert the dispatch
Stitch the requests and responses from logs of microservices that serve a web request for tracing and troubleshooting
These exposed some of the shortcomings of SS such as …
.
I think if we can address all of these, we will be able to increase adoption and see skyrocketed growth.
So,
What are we doing about?
I am very excited to announce that we are launching Project Lightspeed to take SS into next generation
Project Lightspeed advances SS across four pillars…
….
In the next few slides, I will give a glimpse of some of the Lightspeed features
SS has several bookkeeping - (b) plan offset ranges, (e) mark batch done. Forced into storage (b) and (a) and in sequence. Increased latency
In default trigger, eliminate (e) and overlap the execution of mb with storing the offset range async
SS pipelines can be programmed using multiple languages Java, Scala, Python and SQL. Python is a popular choice. Python provides several API …. But there is a gap. Arbitrary Stateful processing - needed for exponential weighted avg. Key challenge with this API is executing arbitrary python code in a JVM system.
Streaming pipelines are brittle. There can be several reasons - surge in data to be processed, resources not adequately provisioned, bug in user code. SS provides tons of metrics ´& logs at micro batch level.