SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Understanding Time in
Structured Streaming
Time and Window API
https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co
m/madhukaraphatak/examples/sparktwo/streaming
● Madhukara Phatak
● Team Lead at Tellius and
Part time consultant at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Evolution of Time in Stream Processing
● Introduction to Structured Streaming
● Different Time Abstractions
● Window API
● Emulating Process Time
● Working With Ingestion Time
● Event Time Abstraction
● Watermarks
● Beyond Time Windows
Evolution of Time in Stream
Processing
Time is King
● Time plays major role in the stream processing
● Latency dictates the kind of operations users want to
do
● Window Time dictates the state users want to maintain
in stream processor
● Batch Time dictates the rate in which users want to
process
● Most of the business questions asked in stream
processing also time based
View of Time in Stream Processing
● Most of early generations stream processing system
optimized for latency
● Latency differentiated between batch processing and
stream processing
● Latency informed the window time and batch time
● So many early generation stream processing only had
one concept of time
● It’s not good enough for new generation systems
Need for different time abstractions
● In a streaming system, there is
○ Source - System from where events are generated
like sensors etc
○ Ingestion System - Temporary storage like Kafka
○ Processing System - Structured Streaming
● Each of these system has their own time
● Typically users want to use different system’s time to do
analysis rather depending upon processing system
Different Time Abstractions
Process Time
● Time is tracked using a clock run by the processing
engine.
● Default abstraction in most of stream processing
engines like DStream API
● Last 10 seconds means the records arrived in last 10
seconds for the processing
● Easy to implement in framework but hard to reason
about for application developers
Event Time
● Event Time is birth time of an event at source
● Event time is the time embed in the data that is coming
into the system
● Last 10 seconds means, all the records generated in
those 10 seconds at the source
● This time is independent of the clock that is kept by the
processing engine
● Hard to implement in framework and easy for
application developer to reason
Ingestion Time
● Ingestion time is the time when events ingested into the
system
● This time is in between of the event time and processing
time
● In processing time, each machine in cluster is used to
assign the time stamp to track events
● Ingestion time, timestamp is assigned in ingestion so
that all the machines in the cluster have exact same
view
● Source Dependent
Introduction to Structured Streaming
Structured Streaming
● Structured Streaming is a new streaming API introduced
in 2.0
● In structured streaming, a stream is modeled as an
infinite table aka infinite Dataset
● As we are using structured abstraction, it’s called
structured streaming API
● All input sources, stream transformations and output
sinks modeled as Dataset
● Stream transformations are represented using SQL and
Dataset DSL
Advantage of Stream as infinite table
● Structured data analysis is first class not layered over
the unstructured runtime
● Easy to combine with batch data as both use same
Dataset abstraction
● Can use full power of SQL language to express stateful
stream operations
● Benefits from SQL optimisations learnt over decades
● Easy to learn and maintain
Window API
Window API from Spark SQL
● Supporting multiple time abstractions in a single API is
tricky
● Flink API makes it an environmental setting to specify
what’s the default time abstraction of application
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
● Spark takes route of explicit time column route which is
inspired from spark sql API
● So spark API is optimised for event time by default
Window API
● Window API in structured streaming is part of group by
operation on Dataset/Dataframe
● val windowedCount = wordsDs
.groupBy(
window($"processingTime", "15 seconds")
)
● It takes three parameters
○ Time Column - Name of the time column.
○ Window Time - How long is the window.
○ Slide Time - An optional parameter to specify the sliding time.
Window on Processing Time
● Useful for porting existing DStream code to structured
streaming API
● By default, window API doesn’t have support processing
time
● But we can emulate processing time, by adding a time
column derived from processing time
● We will be using current_timestamp() API of spark sql to
generate the time column
● Ex : ProcessingTimeWindow
Window on Ingestion Time
● Ingestion time abstraction is useful when each batch of
data is captured in real time but takes considerable
amount to process
● Ingestion time helps us to get better results than
processing time without worrying about out of order
events like in event time
● Ingestion time support is depended on source
● In our example, we will use socket stream which has
support same
● IngestionTimeWindow
Window on Event Time
Importance of Event Time
● Event time is source of truth for all the events
● As more and more stream processing are sensitive to
the time of capture, event time plays a big role
● Event time helps developers to correlate the events
from various sources easily
● Correlation of events within and across sources helps
developer build interesting streaming applications
● So for this reason, event time is default abstraction
supported in structured streaming
Challenges of Event Time
● Event time is cool but it complicates the design of
stream processing frameworks
● Time passed in source may be different than in
processing engine
● How to handle out of order events and how long you
wait ?
● How you correlate events from source which are
running on their own speed ?
● How to reconcile event time with processing time ?
Window on Event Time
● Event Time will be a column embedded in data itself
● Default window API is built for this use case itself
● Windowing on event time make sures that even there is
latency in network we are doing processing on actual
time on source rather than speed of processing engine
● In our example, we analyse apple stock data which
embeds the tick time
● EventTimeExample
Late events
● Whenever we use event time, the challenge how to
handle late events?
● Default nature of the event time window in spark, it
keeps windows forever. That means we can handle late
events forever
● It will be great in application point of view to make sure
that we never miss any event at all
● Ex : EventTimeExample
Need of Watermarks
● Keeping around windows forever is great for logic, but
problematic resources point of view
● As each window creates state in spark, the state keeps
expanding as time passes
● This kind of state keeps using more memory and makes
recovery more difficult
● So we need to a mechanism to restrict time to keep
around windows
● This mechanism is known as watermarks
Watermarks
● Watermarks is a threshold , which defines the how long
we wait for the late events
● Using watermarks with event time make sure spark
drops the window state once this threshold is passed in
source
● Spark will maintain state and allow late data to update
the state until (max event time seen by the engine -
late threshold > T)
● WaterMarkExample
Beyond Time Windows
Need of timeless windows
● Most of the streaming applications use time as the
criteria to do most of the analysis
● But there are use cases in streaming where is state is
not bounded by time
● In the scenarios, we need a mechanism where we can
define window using non time part of the data
● In DStream API, it was tricky. But with structured
streaming we can define it easily
Sessionization
● A session is often period of time that capture different
interactions with an application from user
● In an online portal session normally starts when user
logs into the application and torn down when user
logged out or it expires when there is no activity for
some time
● Session is not a purely time based interaction as
different sessions can go for different time
Session Window
● A session window, is a window which allows us to group
different records from the stream for a specific session
● Window will start when the session starts and evaluated
when session is ended
● Window also will support tracking multiple sessions at a
same time
● Session windows are often used to analyze user
behavior across multiple interactions bounded by
session.
Implementing Session Window
Custom State Management
● There is no direct API to define non time based
windows in Structured Streaming
● As window internally represented using state , we need
use custom state management to do non time windows
● In structured streaming, mapGroupWithState API
allows developers to do custom state management.
● This API behaves similar to mapWithState from
DStream API
Modeling User Session
● case class Session(sessionId:String, value:Double,
endSignal:Option[String])
● sessionId uniquely identifies the given session
● value is the data that is captured for the given session
● endSignal is the explicit signal from the application end
of the session
● This endSignal can be log out event or completion of a
transaction etc
● Timeout will be not part of the record
State Management Models
● Whenever we do custom state management we need to
define two different models
● One keeps around SessionInfo which tracks overall
case class SessionInfo(totalSum: Double)
● SessionUpdate model calculate communicates updates
for each batch
case class SessionUpdate(id: String,totalSum: Doubleexpired: Boolean)
State Management
● We group records by sessionId
● We use mapGroupState API to go through each
record from batch belonging to specific session id.
● For each group, we check is it expired or not by the data
● If expired, we use state.remove for dropping state
● If not expired, we call state.update to update the state
with new data
● SessionisationExample
References
● http://blog.madhukaraphatak.com/categories/introductio
n-structured-streaming/
● https://databricks.com/blog/2017/01/19/real-time-stream
ing-etl-structured-streaming-apache-spark-2-1.html
● https://flink.apache.org/news/2016/05/24/stream-sql.htm
l

Contenu connexe

Tendances

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsShashank L
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 

Tendances (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 

Similaire à Understanding time in structured streaming

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!Xinyu Liu
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Sparkdatamantra
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleDmytro Semenov
 
Copy of Silk performer - KT.pptx
Copy of Silk performer - KT.pptxCopy of Silk performer - KT.pptx
Copy of Silk performer - KT.pptxssuser20fcbe
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Let's get to know the Data Streaming
Let's get to know the Data StreamingLet's get to know the Data Streaming
Let's get to know the Data StreamingKnoldus Inc.
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...Flink Forward
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexApache Apex
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Bhupesh Chawda
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 

Similaire à Understanding time in structured streaming (20)

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Apache flink
Apache flinkApache flink
Apache flink
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Copy of Silk performer - KT.pptx
Copy of Silk performer - KT.pptxCopy of Silk performer - KT.pptx
Copy of Silk performer - KT.pptx
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Let's get to know the Data Streaming
Let's get to know the Data StreamingLet's get to know the Data Streaming
Let's get to know the Data Streaming
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 
Prometheus
PrometheusPrometheus
Prometheus
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache Apex
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
 
Monitoring with riemann
Monitoring with riemannMonitoring with riemann
Monitoring with riemann
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 

Plus de datamantra

Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 

Plus de datamantra (10)

Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 

Dernier

YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 

Dernier (17)

YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 

Understanding time in structured streaming

  • 1. Understanding Time in Structured Streaming Time and Window API https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co m/madhukaraphatak/examples/sparktwo/streaming
  • 2. ● Madhukara Phatak ● Team Lead at Tellius and Part time consultant at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Evolution of Time in Stream Processing ● Introduction to Structured Streaming ● Different Time Abstractions ● Window API ● Emulating Process Time ● Working With Ingestion Time ● Event Time Abstraction ● Watermarks ● Beyond Time Windows
  • 4. Evolution of Time in Stream Processing
  • 5. Time is King ● Time plays major role in the stream processing ● Latency dictates the kind of operations users want to do ● Window Time dictates the state users want to maintain in stream processor ● Batch Time dictates the rate in which users want to process ● Most of the business questions asked in stream processing also time based
  • 6. View of Time in Stream Processing ● Most of early generations stream processing system optimized for latency ● Latency differentiated between batch processing and stream processing ● Latency informed the window time and batch time ● So many early generation stream processing only had one concept of time ● It’s not good enough for new generation systems
  • 7. Need for different time abstractions ● In a streaming system, there is ○ Source - System from where events are generated like sensors etc ○ Ingestion System - Temporary storage like Kafka ○ Processing System - Structured Streaming ● Each of these system has their own time ● Typically users want to use different system’s time to do analysis rather depending upon processing system
  • 9. Process Time ● Time is tracked using a clock run by the processing engine. ● Default abstraction in most of stream processing engines like DStream API ● Last 10 seconds means the records arrived in last 10 seconds for the processing ● Easy to implement in framework but hard to reason about for application developers
  • 10. Event Time ● Event Time is birth time of an event at source ● Event time is the time embed in the data that is coming into the system ● Last 10 seconds means, all the records generated in those 10 seconds at the source ● This time is independent of the clock that is kept by the processing engine ● Hard to implement in framework and easy for application developer to reason
  • 11. Ingestion Time ● Ingestion time is the time when events ingested into the system ● This time is in between of the event time and processing time ● In processing time, each machine in cluster is used to assign the time stamp to track events ● Ingestion time, timestamp is assigned in ingestion so that all the machines in the cluster have exact same view ● Source Dependent
  • 13. Structured Streaming ● Structured Streaming is a new streaming API introduced in 2.0 ● In structured streaming, a stream is modeled as an infinite table aka infinite Dataset ● As we are using structured abstraction, it’s called structured streaming API ● All input sources, stream transformations and output sinks modeled as Dataset ● Stream transformations are represented using SQL and Dataset DSL
  • 14. Advantage of Stream as infinite table ● Structured data analysis is first class not layered over the unstructured runtime ● Easy to combine with batch data as both use same Dataset abstraction ● Can use full power of SQL language to express stateful stream operations ● Benefits from SQL optimisations learnt over decades ● Easy to learn and maintain
  • 16. Window API from Spark SQL ● Supporting multiple time abstractions in a single API is tricky ● Flink API makes it an environmental setting to specify what’s the default time abstraction of application env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) ● Spark takes route of explicit time column route which is inspired from spark sql API ● So spark API is optimised for event time by default
  • 17. Window API ● Window API in structured streaming is part of group by operation on Dataset/Dataframe ● val windowedCount = wordsDs .groupBy( window($"processingTime", "15 seconds") ) ● It takes three parameters ○ Time Column - Name of the time column. ○ Window Time - How long is the window. ○ Slide Time - An optional parameter to specify the sliding time.
  • 18. Window on Processing Time ● Useful for porting existing DStream code to structured streaming API ● By default, window API doesn’t have support processing time ● But we can emulate processing time, by adding a time column derived from processing time ● We will be using current_timestamp() API of spark sql to generate the time column ● Ex : ProcessingTimeWindow
  • 19. Window on Ingestion Time ● Ingestion time abstraction is useful when each batch of data is captured in real time but takes considerable amount to process ● Ingestion time helps us to get better results than processing time without worrying about out of order events like in event time ● Ingestion time support is depended on source ● In our example, we will use socket stream which has support same ● IngestionTimeWindow
  • 21. Importance of Event Time ● Event time is source of truth for all the events ● As more and more stream processing are sensitive to the time of capture, event time plays a big role ● Event time helps developers to correlate the events from various sources easily ● Correlation of events within and across sources helps developer build interesting streaming applications ● So for this reason, event time is default abstraction supported in structured streaming
  • 22. Challenges of Event Time ● Event time is cool but it complicates the design of stream processing frameworks ● Time passed in source may be different than in processing engine ● How to handle out of order events and how long you wait ? ● How you correlate events from source which are running on their own speed ? ● How to reconcile event time with processing time ?
  • 23. Window on Event Time ● Event Time will be a column embedded in data itself ● Default window API is built for this use case itself ● Windowing on event time make sures that even there is latency in network we are doing processing on actual time on source rather than speed of processing engine ● In our example, we analyse apple stock data which embeds the tick time ● EventTimeExample
  • 24. Late events ● Whenever we use event time, the challenge how to handle late events? ● Default nature of the event time window in spark, it keeps windows forever. That means we can handle late events forever ● It will be great in application point of view to make sure that we never miss any event at all ● Ex : EventTimeExample
  • 25. Need of Watermarks ● Keeping around windows forever is great for logic, but problematic resources point of view ● As each window creates state in spark, the state keeps expanding as time passes ● This kind of state keeps using more memory and makes recovery more difficult ● So we need to a mechanism to restrict time to keep around windows ● This mechanism is known as watermarks
  • 26. Watermarks ● Watermarks is a threshold , which defines the how long we wait for the late events ● Using watermarks with event time make sure spark drops the window state once this threshold is passed in source ● Spark will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T) ● WaterMarkExample
  • 28. Need of timeless windows ● Most of the streaming applications use time as the criteria to do most of the analysis ● But there are use cases in streaming where is state is not bounded by time ● In the scenarios, we need a mechanism where we can define window using non time part of the data ● In DStream API, it was tricky. But with structured streaming we can define it easily
  • 29. Sessionization ● A session is often period of time that capture different interactions with an application from user ● In an online portal session normally starts when user logs into the application and torn down when user logged out or it expires when there is no activity for some time ● Session is not a purely time based interaction as different sessions can go for different time
  • 30. Session Window ● A session window, is a window which allows us to group different records from the stream for a specific session ● Window will start when the session starts and evaluated when session is ended ● Window also will support tracking multiple sessions at a same time ● Session windows are often used to analyze user behavior across multiple interactions bounded by session.
  • 32. Custom State Management ● There is no direct API to define non time based windows in Structured Streaming ● As window internally represented using state , we need use custom state management to do non time windows ● In structured streaming, mapGroupWithState API allows developers to do custom state management. ● This API behaves similar to mapWithState from DStream API
  • 33. Modeling User Session ● case class Session(sessionId:String, value:Double, endSignal:Option[String]) ● sessionId uniquely identifies the given session ● value is the data that is captured for the given session ● endSignal is the explicit signal from the application end of the session ● This endSignal can be log out event or completion of a transaction etc ● Timeout will be not part of the record
  • 34. State Management Models ● Whenever we do custom state management we need to define two different models ● One keeps around SessionInfo which tracks overall case class SessionInfo(totalSum: Double) ● SessionUpdate model calculate communicates updates for each batch case class SessionUpdate(id: String,totalSum: Doubleexpired: Boolean)
  • 35. State Management ● We group records by sessionId ● We use mapGroupState API to go through each record from batch belonging to specific session id. ● For each group, we check is it expired or not by the data ● If expired, we use state.remove for dropping state ● If not expired, we call state.update to update the state with new data ● SessionisationExample