SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
A Deep Dive into Stateful Stream
Processing in Structured Streaming
Spark Summit 2018
5th June, San Francisco
Tathagata “TD” Das
@tathadas
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
you
should not have to
reason about streaming
you
should write simple queries
&
Spark
should continuously update the answer
Treat Streams as Unbounded Tables
data stream unbounded input table
new data in the
data stream
=
new rows appended
to a unbounded table
Anatomy of a Streaming Query
Example
Read JSON data from Kafka
Parse nested JSON
Store in structured Parquet table
Get end-to-end failure guarantees
ETL
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
Source
Specify where to read data from
Built-in support for Files / Kafka /
Kinesis*
Can include multiple sources of
different types using join() / union()
*Available only on Databricks Runtime
returns a Spark DataFrame
(common API for batch & streaming data)
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
Kafka DataFrame
key value topic partition offset timestamp
[binary] [binary] "topic" 0 345 1486087873
[binary] [binary] "topic" 3 2890 1486086721
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
Transformations
Cast bytes from Kafka records to a
string, parse it as a json, and
generate nested columns
100s of built-in, optimized SQL
functions like from_json
user-defined functions, lambdas,
function literals with map, flatMap…
Anatomy of a Streaming Query
Sink
Write transformed output to
external storage systems
Built-in support for Files / Kafka
Use foreach to execute arbitrary
code with the output data
Some sinks are transactional and
exactly once (e.g. files)
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
Anatomy of a Streaming Query
Processing Details
Trigger: when to process data
- Fixed interval micro-batches
- As fast as possible micro-batches
- Continuously (new in Spark 2.3)
Checkpoint location: for tracking the
progress of the query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
DataFrames,
Datasets, SQL
Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Write to
Parquet
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new micro-batches of data
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Parquet
Sink
Optimized
Plan
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Series of Incremental
Execution Plans
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
Fault-tolerance with Checkpointing
Checkpointing
Saves processed offset info to stable storage
Saved as JSON for forward-compatibility
Allows recovery from any failure
Can resume after limited changes to your
streaming transformations (e.g. adding new
filters to drop corrupted data, etc.)
end-to-end
exactly-once
guarantees
write
ahead
log
Anatomy of a Streaming Query
ETL
Raw data from Kafka available
as structured data in seconds,
ready for querying
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
2xfaster
Structured Streaming reuses
the Spark SQL Optimizer
and Tungsten Engine
Performance: Benchmark
40-core throughput
700K
33M
65M
0
10
20
30
40
50
60
70
Kafka
Streams
Apache Flink Structured
Streaming
Millionsofrecords/s
More details in our blog post
cheaper
Stateful
Stream Processing
What is Stateless Stream Processing?
Stateless streaming queries (e.g.
ETL) process each record
independent of other records
df.select(from_json("json", schema).as("data"))
.where("data.type = 'typeA')
Spark
stateless
streaming
Every record is parsed into a structured form
and then selected (or not) by the filter
What is Stateful Stream Processing?
Stateful streaming queries
combine information from
multiple records together
.count()
Spark
stateful
streaming
statedf.select(from_json("json", schema).as("data"))
.where("data.type = 'typeA')
Count is the streaming state and every
selected record increments the count
State is the information that
is maintained for future use
statestate
Stateful Micro-Batch Processing
State is versioned between
micro-batches while streaming
query is running
Each micro-batch reads previous
version state and updates it to
new version
Versions used for fault recovery
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
statestatestate
micro-batch incremental execution
Distributed, Fault-tolerant State
State data is distributed across executors
State stored in the executor memory
Micro-batch tasks update the state
Changes are checkpointed with version to
given checkpoint location (e.g. HDFS)
Recovery from failure is automatic
Exactly-once fault-tolerance guarantees!
executor 2
executor 1
driver
state
state
HDFS
tasks
Philosophy
of Stateful Operations
Automatic State
Cleanup
User-defined State
Cleanup
Two types
of Stateful Operations
Automatic State
Cleanup
User-defined State
Cleanup
For SQL operations with well-
defined semantics
State cleanup is automatic
with watermarking because
we precisely know when state
data is not needed any more
For user-defined, arbitrary
stateful operations
No automatic state cleanup
User has to explicitly
manage state
Automatic State
Cleanup
User-defined State
Cleanup
aggregations
deduplications
joins
streaming sessionization
with mapGroupsWithState and
flatMapGroupsWithState
Rest of this talk
Explore built-in stateful operations
How to use watermarks to control state size
How to build arbitrary stateful operations
How to monitor and debug stateful queries
Streaming Aggregation
Aggregation by key and/or time windows
Aggregation by key only
Aggregation by event time windows
Aggregation by both
Supports multiple aggregations,
user-defined functions (UDAFs)!
events
.groupBy("key")
.count()
events
.groupBy(window("timestamp","10 mins"))
.avg("value")
events
.groupBy(
col(key),
window("timestamp","10 mins"))
.agg(avg("value"), corr("value"))
Automatically handles Late Data
12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00Keeping state allows
late data to update
counts of old windows
red = state updated
with late data
But size of the state increases
indefinitely if old windows are not
dropped
Watermarking
Watermark - moving threshold of
how late data is expected to be
and when to drop old state
Trails behind max event time
seen by the engine
Watermark delay = trailing gap
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins
Watermarking
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is "too
late" and dropped
Windows older than watermark
automatically deleted to limit state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
watermark
delay
of 10 mins
Watermarking
max event time
event time
watermark
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
Used only in stateful operations
Ignored in non-stateful streaming
queries and batch queries
watermark
delay
of 10 mins
Watermarking
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in
counts
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in my blog post
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
Watermarking
Trade off between lateness tolerance and state size
lateness toleranceless late data
processed,
less memory
consumed
more late data
processed,
more memory
consumed
state size
Streaming Deduplication
Streaming Deduplication
Drop duplicate records in a stream
Specify columns which uniquely
identify a record
Spark SQL will store past unique
column values as state and drop
any record that matches the state
userActions
.dropDuplicates("uniqueRecordId")
Streaming Deduplication with Watermark
Timestamp as a unique column
along with watermark allows old
values in state to dropped
Records older than watermark delay is
not going to get any further duplicates
Timestamp must be same for
duplicated records
userActions
.withWatermark("timestamp")
.dropDuplicates(
"uniqueRecordId",
"timestamp")
Streaming Joins
Streaming Joins
Spark 2.0+ supports joins between streams and static datasets
Spark 2.3+ supports joins between multiple streams
Join
(ad, impression)
(ad, click)
(ad, impression, click)
Join stream of ad impressions
with another stream of their
corresponding user clicks
Example: Ad Monetization
Streaming Joins
Most of the time click events arrive after their impressions
Sometimes, due to delays, impressions can arrive after clicks
Each stream in a join
needs to buffer past
events as state for
matching with future
events of the other stream
Join
(ad, impression)
(ad, click)
(ad, impression, click)
state
state
Join
(ad, impression)
(ad, click)
(ad, impression, click)
Simple Inner Join
Inner join by ad ID column
Need to buffer all past events as
state, a match can come on the
other stream any time in the future
To allow buffered events to be
dropped, query needs to provide
more time constraints
impressions.join(
clicks,
expr("clickAdId = impressionAdId")
)
state
state
∞
∞
Inner Join + Time constraints + Watermarks
time constraints
Time constraints
Let's assume
Impressions can be 2 hours late
Clicks can be 3 hours late
A click can occur within 1 hour
after the corresponding
impression
val impressionsWithWatermark = impressions
.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks
.withWatermark("clickTime", "3 hours")
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""
))
Join
Range Join
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""
))
Inner Join + Time constraints + Watermarks
Spark calculates
- impressions need to be
buffered for 4 hours
- clicks need to be
buffered for 2 hours
Join
impr. upto 2 hrs late
clicks up 3 hrs late
4-hour
state
2-hour
state
3-hour-late click may match with
impression received 4 hours ago
2-hour-late impression may match
with click received 2 hours ago
Spark drops events older
than these thresholds
Join
Outer Join + Time constraints + Watermarks
Left and right outer joins are
allowed only with time constraints
and watermarks
Needed for correctness, Spark must
output nulls when an event cannot
get any future match
Note: null outputs are delayed as
Spark has to wait for sometime to be
sure that there cannot be any match
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""
),
joinType = "leftOuter"
)
Can be "inner" (default) /"leftOuter"/ "rightOuter"
Arbitrary Stateful
Operations
Arbitrary Stateful Operations
Many use cases require more complicated logic than SQL ops
Example: Tracking user activity on your product
Input: User actions (login, clicks, logout, …)
Output: Latest user status (online, active, inactive, …)
Solution: MapGroupsWithState
General API for per-key user-defined stateful processing
Since Spark 2.2, for Scala and Java only
MapGroupsWithState / FlatMapGroupsWithState
No automatic state clean up and dropping of late data
Adding watermark does not automatically manage late and state data
Explicit state clean up by the user
More powerful + efficient than DStream's mapWithState and
updateStateByKey
MapGroupsWithState / FlatMapGroupsWithState
MapGroupsWithState - How to use?
1. Define the data structures
Input event: UserAction
State data: UserStatus
Output event: UserStatus
(can be different from state)
case class UserAction(
userId: String, action: String)
case class UserStatus(
userId: String, active: Boolean)
MapGroupsWithState
MapGroupsWithState - How to use?
2. Define function to update
state of each grouping
key using the new data
Input
Grouping key: userId
New data: new user actions
Previous state: previous status
of this user
case class UserAction(
userId: String, action: String)
case class UserStatus(
userId: String, active: Boolean)
def updateState(
userId: String,
actions: Iterator[UserAction],
state: GroupState[UserStatus]):UserStatus = {
}
MapGroupsWithState - How to use?
2. Define function to update
state of each grouping key
using the new data
Body
Get previous user status
Update user status with actions
Update state with latest user status
Return the status
def updateState(
userId: String,
actions: Iterator[UserAction],
state: GroupState[UserStatus]):UserStatus = {
}
val prevStatus = state.getOption.getOrElse {
new UserStatus()
}
actions.foreah { action =>
prevStatus.updateWith(action)
}
state.update(prevStatus)
return prevStatus
MapGroupsWithState - How to use?
3. Use the user-defined function
on a grouped Dataset
Works with both batch and
streaming queries
In batch query, the function is called
only once per group with no prior state
def updateState(
userId: String,
actions: Iterator[UserAction],
state: GroupState[UserStatus]):UserStatus = {
}
// process actions, update and return status
userActions
.groupByKey(_.userId)
.mapGroupsWithState(updateState)
Timeouts
Example: Mark a user as inactive when
there is no actions in 1 hour
Timeouts: When a group does not get any
event for a while, then the function is
called for that group with an empty iterator
Must specify a global timeout type, and set
per-group timeout timestamp/duration
Ignored in a batch queries
userActions.withWatermark("timestamp")
.groupByKey(_.userId)
.mapGroupsWithState
(timeoutConf)(updateState)
EventTime
Timeout
ProcessingTime
Timeout
NoTimeout
(default)
userActions
.withWatermark("timestamp")
.groupByKey(_.userId)
.mapGroupsWithState
( timeoutConf )(updateState)
Event-time Timeout - How to use?
1. Enable EventTimeTimeout in
mapGroupsWithState
2. Enable watermarking
3. Update the mapping function
Every time function is called, set
the timeout timestamp using the
max seen event timestamp +
timeout duration
Update state when timeout occurs
def updateState(...): UserStatus = {
if (!state.hasTimedOut) {
// track maxActionTimestamp while
// processing actions and updating state
state.setTimeoutTimestamp(
maxActionTimestamp, "1 hour")
} else { // handle timeout
userStatus.handleTimeout()
state.remove()
}
// return user status
}
EventTimeTimeout
if (!state.hasTimedOut) {
} else { // handle timeout
userStatus.handleTimeout()
state.remove()
} return userStatus
Event-time Timeout - When?
Watermark is calculated with max event time across all groups
For a specific group, if there is no event till watermark exceeds
the timeout timestamp,
Then
Function is called with an empty iterator, and hasTimedOut = true
Else
Function is called with new data, and timeout is disabled
Needs to explicitly set timeout timestamp every time
Processing-time Timeout
Instead of setting timeout
timestamp, function sets
timeout duration (in terms of
wall-clock-time) to wait before
timing out
Independent of watermarks
Note, query downtimes will cause
lots of timeouts after recovery
def updateState(...): UserStatus = {
if (!state.hasTimedOut) {
// handle new data
state.setTimeoutDuration("1 hour")
} else {
// handle timeout
}
return userStatus
}
userActions
.groupByKey(_.userId)
.mapGroupsWithState
(ProcessingTimeTimeout)(updateState)
FlatMapGroupsWithState
More general version where the
function can return any number
of events, possibly none at all
Example: instead of returning
user status, want to return
specific actions that are
significant based on the history
def updateState(
userId: String,
actions: Iterator[UserAction],
state: GroupState[UserStatus]):
Iterator[SpecialUserAction] = {
}
userActions
.groupByKey(_.userId)
.flatMapGroupsWithState
(outputMode, timeoutConf)
(updateState)
userActions
.groupByKey(_.userId)
.flatMapGroupsWithState
(outputMode, timeoutConf)
(updateState)
Function Output Mode
Function output mode* gives Spark insights into
the output from this opaque function
Update Mode - Output events are key-value pairs, each
output is updating the value of a key in the result table
Append Mode - Output events are independent rows
that being appended to the result table
Allows Spark SQL planner to correctly compose
flatMapGroupsWithState with other operations
*Not to be confused with output mode of the query
Update
Mode
Append
Mode
Managing Stateful
Streaming Queries
Optimizing Query State
Set # shuffle partitions to 1-3 times number of cores
Too low = not all cores will be used à lower throughput
Too high = cost of writing state to HDFS will increases à higher latency
Total size of state per worker
Larger state leads to higher overheads of snapshotting, JVM GC pauses, etc.
Monitoring the state of Query State
Get current state metrics using the
last progress of the query
Total number of rows in state
Total memory consumed (approx.)
Get it asynchronously through
StreamingQueryListener API
val progress = query.lastProgress
print(progress.json)
{
...
"stateOperators" : [ {
"numRowsTotal" : 660000,
"memoryUsedBytes" : 120571087
...
} ],
}
new StreamingQueryListener {
...
def onQueryProgress(
event: QueryProgressEvent)
}
Monitoring the state of Query State
Databricks Notebooks integrated with Structured Streaming
Shows size of state along with other processing metrics
Data rates Batch durations # state keys
Debugging Query State
SQL metrics in the Spark UI (SQL
tab, DAG view) expose more
operator-specific stats
Answer questions like
- Is the memory usage skewed?
- Is removing rows slow?
- Is writing checkpoints slow?
Managing Very Large State
State data kept on JVM heap
Can have GC issues with millions of state keys per worker
Limits depend on the size and complexity of state data structures
Latency
spikes of > 20s
due to GC
Managing Very Large Statewith RocksDB
In Databricks Runtime, you can store state locally in RocksDB
Avoids JVM heap, no GC issues with 100 millions state keys per worker
Local RocksDB snapshot files automatically checkpointed to HDFS
Same exactly-once fault-tolerant guarantees
Latency
capped
at 10s
[More info in Databricks Docs]
Continuous Processing
Millisecond-level latencies with Continuous Processing engine
Experimental release in Apache Spark 2.3
Shuffle and stateful operator support coming in future releases
More info:
Talk today! Blog post
More Info
Structured Streaming Docs
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
https://docs.databricks.com/spark/latest/structured-streaming/index.html
Databricks blog posts for more focused discussions
https://databricks.com/blog/category/engineering/streaming
My previous talk on the basics of Structured Streaming
https://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming
Questions?
Office hours today at 2:00 PM at
Databricks Booth
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today
databricks.com

Contenu connexe

Tendances

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 

Tendances (20)

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 

Similaire à A Deep Dive into Stateful Stream Processing

Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Microsoft Tech Community
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark Anyscale
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Bartosz Konieczny
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 

Similaire à A Deep Dive into Stateful Stream Processing (20)

Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Dernier

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Dernier (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

A Deep Dive into Stateful Stream Processing

  • 1. A Deep Dive into Stateful Stream Processing in Structured Streaming Spark Summit 2018 5th June, San Francisco Tathagata “TD” Das @tathadas
  • 2. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  • 3. you should not have to reason about streaming
  • 4. you should write simple queries & Spark should continuously update the answer
  • 5. Treat Streams as Unbounded Tables data stream unbounded input table new data in the data stream = new rows appended to a unbounded table
  • 6. Anatomy of a Streaming Query Example Read JSON data from Kafka Parse nested JSON Store in structured Parquet table Get end-to-end failure guarantees ETL
  • 7. Anatomy of a Streaming Query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Source Specify where to read data from Built-in support for Files / Kafka / Kinesis* Can include multiple sources of different types using join() / union() *Available only on Databricks Runtime returns a Spark DataFrame (common API for batch & streaming data)
  • 8. Anatomy of a Streaming Query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Kafka DataFrame key value topic partition offset timestamp [binary] [binary] "topic" 0 345 1486087873 [binary] [binary] "topic" 3 2890 1486086721
  • 9. Anatomy of a Streaming Query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) Transformations Cast bytes from Kafka records to a string, parse it as a json, and generate nested columns 100s of built-in, optimized SQL functions like from_json user-defined functions, lambdas, function literals with map, flatMap…
  • 10. Anatomy of a Streaming Query Sink Write transformed output to external storage systems Built-in support for Files / Kafka Use foreach to execute arbitrary code with the output data Some sinks are transactional and exactly once (e.g. files) spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/")
  • 11. Anatomy of a Streaming Query Processing Details Trigger: when to process data - Fixed interval micro-batches - As fast as possible micro-batches - Continuously (new in Spark 2.3) Checkpoint location: for tracking the progress of the query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start()
  • 12. DataFrames, Datasets, SQL Logical Plan Read from Kafka Project device, signal Filter signal > 15 Write to Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new micro-batches of data Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Plan spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start() Series of Incremental Execution Plans process newdata t = 1 t = 2 t = 3 process newdata process newdata
  • 13. process newdata t = 1 t = 2 t = 3 process newdata process newdata Fault-tolerance with Checkpointing Checkpointing Saves processed offset info to stable storage Saved as JSON for forward-compatibility Allows recovery from any failure Can resume after limited changes to your streaming transformations (e.g. adding new filters to drop corrupted data, etc.) end-to-end exactly-once guarantees write ahead log
  • 14. Anatomy of a Streaming Query ETL Raw data from Kafka available as structured data in seconds, ready for querying spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start()
  • 15. 2xfaster Structured Streaming reuses the Spark SQL Optimizer and Tungsten Engine Performance: Benchmark 40-core throughput 700K 33M 65M 0 10 20 30 40 50 60 70 Kafka Streams Apache Flink Structured Streaming Millionsofrecords/s More details in our blog post cheaper
  • 17. What is Stateless Stream Processing? Stateless streaming queries (e.g. ETL) process each record independent of other records df.select(from_json("json", schema).as("data")) .where("data.type = 'typeA') Spark stateless streaming Every record is parsed into a structured form and then selected (or not) by the filter
  • 18. What is Stateful Stream Processing? Stateful streaming queries combine information from multiple records together .count() Spark stateful streaming statedf.select(from_json("json", schema).as("data")) .where("data.type = 'typeA') Count is the streaming state and every selected record increments the count State is the information that is maintained for future use statestate
  • 19. Stateful Micro-Batch Processing State is versioned between micro-batches while streaming query is running Each micro-batch reads previous version state and updates it to new version Versions used for fault recovery process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src statestatestate micro-batch incremental execution
  • 20. Distributed, Fault-tolerant State State data is distributed across executors State stored in the executor memory Micro-batch tasks update the state Changes are checkpointed with version to given checkpoint location (e.g. HDFS) Recovery from failure is automatic Exactly-once fault-tolerance guarantees! executor 2 executor 1 driver state state HDFS tasks
  • 23. Automatic State Cleanup User-defined State Cleanup For SQL operations with well- defined semantics State cleanup is automatic with watermarking because we precisely know when state data is not needed any more For user-defined, arbitrary stateful operations No automatic state cleanup User has to explicitly manage state
  • 24. Automatic State Cleanup User-defined State Cleanup aggregations deduplications joins streaming sessionization with mapGroupsWithState and flatMapGroupsWithState
  • 25. Rest of this talk Explore built-in stateful operations How to use watermarks to control state size How to build arbitrary stateful operations How to monitor and debug stateful queries
  • 27. Aggregation by key and/or time windows Aggregation by key only Aggregation by event time windows Aggregation by both Supports multiple aggregations, user-defined functions (UDAFs)! events .groupBy("key") .count() events .groupBy(window("timestamp","10 mins")) .avg("value") events .groupBy( col(key), window("timestamp","10 mins")) .agg(avg("value"), corr("value"))
  • 28. Automatically handles Late Data 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00Keeping state allows late data to update counts of old windows red = state updated with late data But size of the state increases indefinitely if old windows are not dropped
  • 29. Watermarking Watermark - moving threshold of how late data is expected to be and when to drop old state Trails behind max event time seen by the engine Watermark delay = trailing gap event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  • 30. Watermarking Data newer than watermark may be late, but allowed to aggregate Data older than watermark is "too late" and dropped Windows older than watermark automatically deleted to limit state max event time event time watermark late data allowed to aggregate data too late, dropped watermark delay of 10 mins
  • 31. Watermarking max event time event time watermark parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowed to aggregate data too late, dropped Used only in stateful operations Ignored in non-stateful streaming queries and batch queries watermark delay of 10 mins
  • 32. Watermarking data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts system tracks max observed event time 12:08 wm = 12:04 10min 12:14 More details in my blog post parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count()
  • 33. Watermarking Trade off between lateness tolerance and state size lateness toleranceless late data processed, less memory consumed more late data processed, more memory consumed state size
  • 35. Streaming Deduplication Drop duplicate records in a stream Specify columns which uniquely identify a record Spark SQL will store past unique column values as state and drop any record that matches the state userActions .dropDuplicates("uniqueRecordId")
  • 36. Streaming Deduplication with Watermark Timestamp as a unique column along with watermark allows old values in state to dropped Records older than watermark delay is not going to get any further duplicates Timestamp must be same for duplicated records userActions .withWatermark("timestamp") .dropDuplicates( "uniqueRecordId", "timestamp")
  • 38. Streaming Joins Spark 2.0+ supports joins between streams and static datasets Spark 2.3+ supports joins between multiple streams Join (ad, impression) (ad, click) (ad, impression, click) Join stream of ad impressions with another stream of their corresponding user clicks Example: Ad Monetization
  • 39. Streaming Joins Most of the time click events arrive after their impressions Sometimes, due to delays, impressions can arrive after clicks Each stream in a join needs to buffer past events as state for matching with future events of the other stream Join (ad, impression) (ad, click) (ad, impression, click) state state
  • 40. Join (ad, impression) (ad, click) (ad, impression, click) Simple Inner Join Inner join by ad ID column Need to buffer all past events as state, a match can come on the other stream any time in the future To allow buffered events to be dropped, query needs to provide more time constraints impressions.join( clicks, expr("clickAdId = impressionAdId") ) state state ∞ ∞
  • 41. Inner Join + Time constraints + Watermarks time constraints Time constraints Let's assume Impressions can be 2 hours late Clicks can be 3 hours late A click can occur within 1 hour after the corresponding impression val impressionsWithWatermark = impressions .withWatermark("impressionTime", "2 hours") val clicksWithWatermark = clicks .withWatermark("clickTime", "3 hours") impressionsWithWatermark.join( clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 1 hour """ )) Join Range Join
  • 42. impressionsWithWatermark.join( clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 1 hour """ )) Inner Join + Time constraints + Watermarks Spark calculates - impressions need to be buffered for 4 hours - clicks need to be buffered for 2 hours Join impr. upto 2 hrs late clicks up 3 hrs late 4-hour state 2-hour state 3-hour-late click may match with impression received 4 hours ago 2-hour-late impression may match with click received 2 hours ago Spark drops events older than these thresholds
  • 43. Join Outer Join + Time constraints + Watermarks Left and right outer joins are allowed only with time constraints and watermarks Needed for correctness, Spark must output nulls when an event cannot get any future match Note: null outputs are delayed as Spark has to wait for sometime to be sure that there cannot be any match impressionsWithWatermark.join( clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 1 hour """ ), joinType = "leftOuter" ) Can be "inner" (default) /"leftOuter"/ "rightOuter"
  • 45. Arbitrary Stateful Operations Many use cases require more complicated logic than SQL ops Example: Tracking user activity on your product Input: User actions (login, clicks, logout, …) Output: Latest user status (online, active, inactive, …) Solution: MapGroupsWithState General API for per-key user-defined stateful processing Since Spark 2.2, for Scala and Java only MapGroupsWithState / FlatMapGroupsWithState
  • 46. No automatic state clean up and dropping of late data Adding watermark does not automatically manage late and state data Explicit state clean up by the user More powerful + efficient than DStream's mapWithState and updateStateByKey MapGroupsWithState / FlatMapGroupsWithState
  • 47. MapGroupsWithState - How to use? 1. Define the data structures Input event: UserAction State data: UserStatus Output event: UserStatus (can be different from state) case class UserAction( userId: String, action: String) case class UserStatus( userId: String, active: Boolean) MapGroupsWithState
  • 48. MapGroupsWithState - How to use? 2. Define function to update state of each grouping key using the new data Input Grouping key: userId New data: new user actions Previous state: previous status of this user case class UserAction( userId: String, action: String) case class UserStatus( userId: String, active: Boolean) def updateState( userId: String, actions: Iterator[UserAction], state: GroupState[UserStatus]):UserStatus = { }
  • 49. MapGroupsWithState - How to use? 2. Define function to update state of each grouping key using the new data Body Get previous user status Update user status with actions Update state with latest user status Return the status def updateState( userId: String, actions: Iterator[UserAction], state: GroupState[UserStatus]):UserStatus = { } val prevStatus = state.getOption.getOrElse { new UserStatus() } actions.foreah { action => prevStatus.updateWith(action) } state.update(prevStatus) return prevStatus
  • 50. MapGroupsWithState - How to use? 3. Use the user-defined function on a grouped Dataset Works with both batch and streaming queries In batch query, the function is called only once per group with no prior state def updateState( userId: String, actions: Iterator[UserAction], state: GroupState[UserStatus]):UserStatus = { } // process actions, update and return status userActions .groupByKey(_.userId) .mapGroupsWithState(updateState)
  • 51. Timeouts Example: Mark a user as inactive when there is no actions in 1 hour Timeouts: When a group does not get any event for a while, then the function is called for that group with an empty iterator Must specify a global timeout type, and set per-group timeout timestamp/duration Ignored in a batch queries userActions.withWatermark("timestamp") .groupByKey(_.userId) .mapGroupsWithState (timeoutConf)(updateState) EventTime Timeout ProcessingTime Timeout NoTimeout (default)
  • 52. userActions .withWatermark("timestamp") .groupByKey(_.userId) .mapGroupsWithState ( timeoutConf )(updateState) Event-time Timeout - How to use? 1. Enable EventTimeTimeout in mapGroupsWithState 2. Enable watermarking 3. Update the mapping function Every time function is called, set the timeout timestamp using the max seen event timestamp + timeout duration Update state when timeout occurs def updateState(...): UserStatus = { if (!state.hasTimedOut) { // track maxActionTimestamp while // processing actions and updating state state.setTimeoutTimestamp( maxActionTimestamp, "1 hour") } else { // handle timeout userStatus.handleTimeout() state.remove() } // return user status } EventTimeTimeout if (!state.hasTimedOut) { } else { // handle timeout userStatus.handleTimeout() state.remove() } return userStatus
  • 53. Event-time Timeout - When? Watermark is calculated with max event time across all groups For a specific group, if there is no event till watermark exceeds the timeout timestamp, Then Function is called with an empty iterator, and hasTimedOut = true Else Function is called with new data, and timeout is disabled Needs to explicitly set timeout timestamp every time
  • 54. Processing-time Timeout Instead of setting timeout timestamp, function sets timeout duration (in terms of wall-clock-time) to wait before timing out Independent of watermarks Note, query downtimes will cause lots of timeouts after recovery def updateState(...): UserStatus = { if (!state.hasTimedOut) { // handle new data state.setTimeoutDuration("1 hour") } else { // handle timeout } return userStatus } userActions .groupByKey(_.userId) .mapGroupsWithState (ProcessingTimeTimeout)(updateState)
  • 55. FlatMapGroupsWithState More general version where the function can return any number of events, possibly none at all Example: instead of returning user status, want to return specific actions that are significant based on the history def updateState( userId: String, actions: Iterator[UserAction], state: GroupState[UserStatus]): Iterator[SpecialUserAction] = { } userActions .groupByKey(_.userId) .flatMapGroupsWithState (outputMode, timeoutConf) (updateState)
  • 56. userActions .groupByKey(_.userId) .flatMapGroupsWithState (outputMode, timeoutConf) (updateState) Function Output Mode Function output mode* gives Spark insights into the output from this opaque function Update Mode - Output events are key-value pairs, each output is updating the value of a key in the result table Append Mode - Output events are independent rows that being appended to the result table Allows Spark SQL planner to correctly compose flatMapGroupsWithState with other operations *Not to be confused with output mode of the query Update Mode Append Mode
  • 58. Optimizing Query State Set # shuffle partitions to 1-3 times number of cores Too low = not all cores will be used à lower throughput Too high = cost of writing state to HDFS will increases à higher latency Total size of state per worker Larger state leads to higher overheads of snapshotting, JVM GC pauses, etc.
  • 59. Monitoring the state of Query State Get current state metrics using the last progress of the query Total number of rows in state Total memory consumed (approx.) Get it asynchronously through StreamingQueryListener API val progress = query.lastProgress print(progress.json) { ... "stateOperators" : [ { "numRowsTotal" : 660000, "memoryUsedBytes" : 120571087 ... } ], } new StreamingQueryListener { ... def onQueryProgress( event: QueryProgressEvent) }
  • 60. Monitoring the state of Query State Databricks Notebooks integrated with Structured Streaming Shows size of state along with other processing metrics Data rates Batch durations # state keys
  • 61. Debugging Query State SQL metrics in the Spark UI (SQL tab, DAG view) expose more operator-specific stats Answer questions like - Is the memory usage skewed? - Is removing rows slow? - Is writing checkpoints slow?
  • 62. Managing Very Large State State data kept on JVM heap Can have GC issues with millions of state keys per worker Limits depend on the size and complexity of state data structures Latency spikes of > 20s due to GC
  • 63. Managing Very Large Statewith RocksDB In Databricks Runtime, you can store state locally in RocksDB Avoids JVM heap, no GC issues with 100 millions state keys per worker Local RocksDB snapshot files automatically checkpointed to HDFS Same exactly-once fault-tolerant guarantees Latency capped at 10s [More info in Databricks Docs]
  • 64. Continuous Processing Millisecond-level latencies with Continuous Processing engine Experimental release in Apache Spark 2.3 Shuffle and stateful operator support coming in future releases More info: Talk today! Blog post
  • 65. More Info Structured Streaming Docs http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html https://docs.databricks.com/spark/latest/structured-streaming/index.html Databricks blog posts for more focused discussions https://databricks.com/blog/category/engineering/streaming My previous talk on the basics of Structured Streaming https://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming
  • 66. Questions? Office hours today at 2:00 PM at Databricks Booth
  • 67. UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today databricks.com