SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Solving sessionization
problem with Apache
Spark batch and
streaming processing
Bartosz Konieczny
@waitingforcode1
About me
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode #github.com/bartosz25
#canalplus #Paris
2
3
Sessions
"user activity followed by a
closing action or a period of
inactivity"
4
5
© https://pixabay.com/users/maxmann-665103/ from https://pixabay.com
Batch architecture
6
data producer
sync consumer input logs
(DFS)
input logs
(streaming broker)
orchestrator
sessions
generator
<triggers>
previous
window raw
sessions
(DFS)
output
sessions
(DFS)
Streaming architecture
7
data producer
sessions
generator
output
sessions
(DFS)
metadata state
<uses>
checkpoint location
input logs
(streaming broker)
Batch
implementation
The code
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound)).cache()
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
9
Full outer join
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
10
processing logic
previous
window
active
sessions
new input
logs
full outer join
Watermark simulation
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
case class SessionIntermediaryState(userId:
Long, … expirationTimeMillisUtc: Long,
isActive: Boolean)
11
Save modes
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
SaveMode.Append ⇒
duplicates & invalid results
(e.g. multiplied revenue!)
SaveMode.ErrorIfExists ⇒
failures & maintenance
burden
SaveMode.Ignore ⇒ no
data & old data present in
case of reprocessing
SaveMode.Overwrite ⇒
always fresh data & easy
maintenance
12
Streaming
implementation
The code
val writeQuery = query.writeStream.outputMode(OutputMode.Update())
.option("checkpointLocation", s"s3://my-checkpoint-bucket")
.foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => {
BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}")
})
val dataFrame = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaConfiguration.broker).option(...) .load()
val query = dataFrame.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", Visit.Schema).as("data"))
.select($"data.*").withWatermark("event_time", "3 minutes")
.groupByKey(row => row.getAs[Long]("user_id"))
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout())
(mapStreamingLogsToSessions(sessionTimeout))
watermark - late events & state
expiration
stateful processing - sessions
generation
checkpoint - fault-tolerance
14
Checkpoint - fault-tolerance
load state
for t0
query
load offsets
to process &
write them
for t1
query
process data
write
processed
offsets
write state
checkpoint location
state store offset log commit log
val writeQuery = query.writeStream.outputMode(OutputMode.Update())
.option("checkpointLocation", s"s3://sessionization-demo/checkpoint")
.foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => {
BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}")
})
.start()
15
Checkpoint - fault-tolerance
load state
for t1
query
load offsets
to process &
write them
for t1
query
process data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log
16
Stateful processing
update
remove
get
getput,remove
write update
finalize file
make snapshot
recover state
def mapStreamingLogsToSessions(timeoutDurationMs: Long)(key: Long, logs: Iterator[Row],
currentState: GroupState[SessionIntermediaryState]): SessionIntermediaryState = {
if (currentState.hasTimedOut) {
val expiredState = currentState.get.expire
currentState.remove()
expiredState
} else {
val newState = currentState.getOption.map(state => state.updateWithNewLogs(logs, timeoutDurationMs))
.getOrElse(SessionIntermediaryState.createNew(logs, timeoutDurationMs))
currentState.update(newState)
currentState.setTimeoutTimestamp(currentState.getCurrentWatermarkMs() + timeoutDurationMs)
currentState.get
}
}
17
Stateful processing
update
remove
get
getput,remove
- write update
- finalize file
- make snapshot
recover state
18
.mapGroupsWithState(...)
state store
TreeMap[Long,
ConcurrentHashMap[UnsafeRow,
UnsafeRow]
]
in-memory storage for the most
recent versions
1.delta
2.delta
3.snapshot
checkpoint
location
Watermark
val sessionTimeout = TimeUnit.MINUTES.toMillis(5)
val query = dataFrame.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", Visit.Schema).as("data"))
.select($"data.*")
.withWatermark("event_time", "3 minutes")
.groupByKey(row => row.getAs[Long]("user_id"))
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout())
(Mapping.mapStreamingLogsToSessions(sessionTimeout))
19
Watermark - late events
on-time
event
late
event
20
.mapGroupsWithState(...)
Watermark - expired state
State representation [simplified]
{value, TTL configuration}
Algorithm:
1. Update all states with new data → eventually extend TTL
2. Retrieve TTL configuration for the query → here: watermark
3. Retrieve all states that expired → no new data in this query & TTL expired
4. Call mapGroupsWithState on it with hasTimedOut param = true & no new data
(Iterator.empty)
// full implementation: org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec.InputProcessor
21
Data reprocessing
Batch
reschedule your job
© https://pics.me.me/just-one-click-and-the-zoo-is-mine-8769663.png
Streaming
State store
1. Restored state is the most recent snapshot
2. Restored state is not the most recent snapshot but a snapshot exists
3. Restored state is not the most recent snapshot and a snapshot doesn't exist
27
1.delta 3.snapshot2.delta
1.delta 3.snapshot2.delta 4.delta
1.delta 3.delta2.delta 4.delta
State store configuration
spark.sql.streaming.stateStore:
→ .minDeltasForSnapshot
→ .maintenanceInterval
28
spark.sql.streaming:
→ .maxBatchesToRetainInMemory
Checkpoint configuration
spark.sql.streaming.minBatchesToRetain
29
Few takeaways
● yet another TDD acronym - Trade-Off Driven Development
○ simplicity for latency
○ simplicity for accuracy
○ scaling for latency
● AWS
○ Kinesis - short retention period = reprocessing boundary, connector
○ S3 - trade reliability for performance
○ EMR - transient cluster
○ Redshift - COPY
● Apache Spark
○ watermarks everywhere - batch simulation
○ state store configuration
○ restore mechanism
○ overwrite idempotent mode
30
Resources
● https://github.com/bartosz25/sessionization-de
mo
● https://www.waitingforcode.com/tags/spark-ai-s
ummit-europe-2019-articles
31
Thank you!Bartosz Konieczny
@waitingforcode / github.com/bartosz25 / waitingforcode.com
Canal+
@canaltechteam

Contenu connexe

Tendances

Redis Introduction
Redis IntroductionRedis Introduction
Redis IntroductionAlex Su
 
Distributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedDistributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedYugabyte
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksEDB
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
InnoDB Performance Optimisation
InnoDB Performance OptimisationInnoDB Performance Optimisation
InnoDB Performance OptimisationMydbops
 
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIData Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redisTanu Siwag
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼NeoClova
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemFrederic Descamps
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBMongoDB
 
MariaDB Administrator 교육
MariaDB Administrator 교육 MariaDB Administrator 교육
MariaDB Administrator 교육 Sangmo Kim
 
Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with RedisGeorge Platon
 
PostgreSQL 12 New Features with Examples (English) GA
PostgreSQL 12 New Features with Examples (English) GAPostgreSQL 12 New Features with Examples (English) GA
PostgreSQL 12 New Features with Examples (English) GANoriyoshi Shinoda
 

Tendances (20)

Redis Introduction
Redis IntroductionRedis Introduction
Redis Introduction
 
Query logging with proxysql
Query logging with proxysqlQuery logging with proxysql
Query logging with proxysql
 
Distributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedDistributed SQL Databases Deconstructed
Distributed SQL Databases Deconstructed
 
redis basics
redis basicsredis basics
redis basics
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
 
The PostgreSQL Query Planner
The PostgreSQL Query PlannerThe PostgreSQL Query Planner
The PostgreSQL Query Planner
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
InnoDB Performance Optimisation
InnoDB Performance OptimisationInnoDB Performance Optimisation
InnoDB Performance Optimisation
 
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIData Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
kafka
kafkakafka
kafka
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
 
MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database System
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
MariaDB Administrator 교육
MariaDB Administrator 교육 MariaDB Administrator 교육
MariaDB Administrator 교육
 
Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with Redis
 
PostgreSQL 12 New Features with Examples (English) GA
PostgreSQL 12 New Features with Examples (English) GAPostgreSQL 12 New Features with Examples (English) GA
PostgreSQL 12 New Features with Examples (English) GA
 

Similaire à Using Apache Spark to Solve Sessionization Problem in Batch and Streaming

Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
[JEEConf-2017] RxJava as a key component in mature Big Data product
[JEEConf-2017] RxJava as a key component in mature Big Data product[JEEConf-2017] RxJava as a key component in mature Big Data product
[JEEConf-2017] RxJava as a key component in mature Big Data productIgor Lozynskyi
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19confluent
 
Online Meetup: Why should container system / platform builders care about con...
Online Meetup: Why should container system / platform builders care about con...Online Meetup: Why should container system / platform builders care about con...
Online Meetup: Why should container system / platform builders care about con...Docker, Inc.
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamImre Nagi
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
A new kind of BPM with Activiti
A new kind of BPM with ActivitiA new kind of BPM with Activiti
A new kind of BPM with ActivitiAlfresco Software
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsAmazon Web Services
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every dayVadym Khondar
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 
こわくないよ❤️ Playframeworkソースコードリーディング入門
こわくないよ❤️ Playframeworkソースコードリーディング入門こわくないよ❤️ Playframeworkソースコードリーディング入門
こわくないよ❤️ Playframeworkソースコードリーディング入門tanacasino
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorIntroducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorWSO2
 
Scalable Angular 2 Application Architecture
Scalable Angular 2 Application ArchitectureScalable Angular 2 Application Architecture
Scalable Angular 2 Application ArchitectureFDConf
 
VPN Access Runbook
VPN Access RunbookVPN Access Runbook
VPN Access RunbookTaha Shakeel
 
Event Sourcing - what could go wrong - Devoxx BE
Event Sourcing - what could go wrong - Devoxx BEEvent Sourcing - what could go wrong - Devoxx BE
Event Sourcing - what could go wrong - Devoxx BEAndrzej Ludwikowski
 

Similaire à Using Apache Spark to Solve Sessionization Problem in Batch and Streaming (20)

Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
[JEEConf-2017] RxJava as a key component in mature Big Data product
[JEEConf-2017] RxJava as a key component in mature Big Data product[JEEConf-2017] RxJava as a key component in mature Big Data product
[JEEConf-2017] RxJava as a key component in mature Big Data product
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
 
Online Meetup: Why should container system / platform builders care about con...
Online Meetup: Why should container system / platform builders care about con...Online Meetup: Why should container system / platform builders care about con...
Online Meetup: Why should container system / platform builders care about con...
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
A new kind of BPM with Activiti
A new kind of BPM with ActivitiA new kind of BPM with Activiti
A new kind of BPM with Activiti
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time Analytics
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
JavaZone 2014 - goto java;
JavaZone 2014 - goto java;JavaZone 2014 - goto java;
JavaZone 2014 - goto java;
 
こわくないよ❤️ Playframeworkソースコードリーディング入門
こわくないよ❤️ Playframeworkソースコードリーディング入門こわくないよ❤️ Playframeworkソースコードリーディング入門
こわくないよ❤️ Playframeworkソースコードリーディング入門
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorIntroducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event Processor
 
Scalable Angular 2 Application Architecture
Scalable Angular 2 Application ArchitectureScalable Angular 2 Application Architecture
Scalable Angular 2 Application Architecture
 
VPN Access Runbook
VPN Access RunbookVPN Access Runbook
VPN Access Runbook
 
Event Sourcing - what could go wrong - Devoxx BE
Event Sourcing - what could go wrong - Devoxx BEEvent Sourcing - what could go wrong - Devoxx BE
Event Sourcing - what could go wrong - Devoxx BE
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 

Dernier (20)

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 

Using Apache Spark to Solve Sessionization Problem in Batch and Streaming

  • 1. Solving sessionization problem with Apache Spark batch and streaming processing Bartosz Konieczny @waitingforcode1
  • 2. About me Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 #canalplus #Paris 2
  • 3. 3
  • 4. Sessions "user activity followed by a closing action or a period of inactivity" 4
  • 6. Batch architecture 6 data producer sync consumer input logs (DFS) input logs (streaming broker) orchestrator sessions generator <triggers> previous window raw sessions (DFS) output sessions (DFS)
  • 7. Streaming architecture 7 data producer sessions generator output sessions (DFS) metadata state <uses> checkpoint location input logs (streaming broker)
  • 9. The code val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)).cache() joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) 9
  • 10. Full outer join val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) 10 processing logic previous window active sessions new input logs full outer join
  • 11. Watermark simulation val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) case class SessionIntermediaryState(userId: Long, … expirationTimeMillisUtc: Long, isActive: Boolean) 11
  • 12. Save modes val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) SaveMode.Append ⇒ duplicates & invalid results (e.g. multiplied revenue!) SaveMode.ErrorIfExists ⇒ failures & maintenance burden SaveMode.Ignore ⇒ no data & old data present in case of reprocessing SaveMode.Overwrite ⇒ always fresh data & easy maintenance 12
  • 14. The code val writeQuery = query.writeStream.outputMode(OutputMode.Update()) .option("checkpointLocation", s"s3://my-checkpoint-bucket") .foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => { BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}") }) val dataFrame = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaConfiguration.broker).option(...) .load() val query = dataFrame.selectExpr("CAST(value AS STRING)") .select(functions.from_json($"value", Visit.Schema).as("data")) .select($"data.*").withWatermark("event_time", "3 minutes") .groupByKey(row => row.getAs[Long]("user_id")) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout()) (mapStreamingLogsToSessions(sessionTimeout)) watermark - late events & state expiration stateful processing - sessions generation checkpoint - fault-tolerance 14
  • 15. Checkpoint - fault-tolerance load state for t0 query load offsets to process & write them for t1 query process data write processed offsets write state checkpoint location state store offset log commit log val writeQuery = query.writeStream.outputMode(OutputMode.Update()) .option("checkpointLocation", s"s3://sessionization-demo/checkpoint") .foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => { BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}") }) .start() 15
  • 16. Checkpoint - fault-tolerance load state for t1 query load offsets to process & write them for t1 query process data confirm processed offsets & next watermark commit state t2 partition-based checkpoint location state store offset log commit log 16
  • 17. Stateful processing update remove get getput,remove write update finalize file make snapshot recover state def mapStreamingLogsToSessions(timeoutDurationMs: Long)(key: Long, logs: Iterator[Row], currentState: GroupState[SessionIntermediaryState]): SessionIntermediaryState = { if (currentState.hasTimedOut) { val expiredState = currentState.get.expire currentState.remove() expiredState } else { val newState = currentState.getOption.map(state => state.updateWithNewLogs(logs, timeoutDurationMs)) .getOrElse(SessionIntermediaryState.createNew(logs, timeoutDurationMs)) currentState.update(newState) currentState.setTimeoutTimestamp(currentState.getCurrentWatermarkMs() + timeoutDurationMs) currentState.get } } 17
  • 18. Stateful processing update remove get getput,remove - write update - finalize file - make snapshot recover state 18 .mapGroupsWithState(...) state store TreeMap[Long, ConcurrentHashMap[UnsafeRow, UnsafeRow] ] in-memory storage for the most recent versions 1.delta 2.delta 3.snapshot checkpoint location
  • 19. Watermark val sessionTimeout = TimeUnit.MINUTES.toMillis(5) val query = dataFrame.selectExpr("CAST(value AS STRING)") .select(functions.from_json($"value", Visit.Schema).as("data")) .select($"data.*") .withWatermark("event_time", "3 minutes") .groupByKey(row => row.getAs[Long]("user_id")) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout()) (Mapping.mapStreamingLogsToSessions(sessionTimeout)) 19
  • 20. Watermark - late events on-time event late event 20 .mapGroupsWithState(...)
  • 21. Watermark - expired state State representation [simplified] {value, TTL configuration} Algorithm: 1. Update all states with new data → eventually extend TTL 2. Retrieve TTL configuration for the query → here: watermark 3. Retrieve all states that expired → no new data in this query & TTL expired 4. Call mapGroupsWithState on it with hasTimedOut param = true & no new data (Iterator.empty) // full implementation: org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec.InputProcessor 21
  • 23. Batch
  • 24. reschedule your job © https://pics.me.me/just-one-click-and-the-zoo-is-mine-8769663.png
  • 26.
  • 27. State store 1. Restored state is the most recent snapshot 2. Restored state is not the most recent snapshot but a snapshot exists 3. Restored state is not the most recent snapshot and a snapshot doesn't exist 27 1.delta 3.snapshot2.delta 1.delta 3.snapshot2.delta 4.delta 1.delta 3.delta2.delta 4.delta
  • 28. State store configuration spark.sql.streaming.stateStore: → .minDeltasForSnapshot → .maintenanceInterval 28 spark.sql.streaming: → .maxBatchesToRetainInMemory
  • 30. Few takeaways ● yet another TDD acronym - Trade-Off Driven Development ○ simplicity for latency ○ simplicity for accuracy ○ scaling for latency ● AWS ○ Kinesis - short retention period = reprocessing boundary, connector ○ S3 - trade reliability for performance ○ EMR - transient cluster ○ Redshift - COPY ● Apache Spark ○ watermarks everywhere - batch simulation ○ state store configuration ○ restore mechanism ○ overwrite idempotent mode 30
  • 32. Thank you!Bartosz Konieczny @waitingforcode / github.com/bartosz25 / waitingforcode.com Canal+ @canaltechteam