Streaming of data has become the need of the hour. But do we really know how streaming exactly works? What are its benefits? Where and how to stream data in your big data architecture correctly? How to process the streamed data efficiently? What challenges do we face when we move from batch processing to stream processing? What is Stateful stream processing and what is stateless stream processing? Which one to opt and when? Let us address all these queries!
1. Presented By: Anuj & Jashan
Let’s get to know
Streaming
- A developer’s point of view
2. Our Agenda
01 Streaming: What, Why,
Benefits
02 Different Architectures
03 Challenges in Stream Processing
04 Types of Stream Processing: Stateless &
Stateful
05 Stateful Stream Processing:
Elaborated
3. What is Data Streaming?
● A continuous flow of data is called Data
Streaming.
● Ex: Surge in IoT devices caused more data to
gather.
● Data gathered at real time can be processed
to get real time results.
● Stream processing is the practice of taking
action on a series of data at the time the
data is created.
4. Why Data Streaming ?
● Providing insights faster.
● Handle never-ending stream of events
● Easy inspection of data from multiple streams
simultaneously
● Stream processing can work with a lot less hardware
than batch processing
● To Design data processing engine with infinite data sets
in mind.
5. Streaming: Benefits
● Lot less hardware required.
● Real-time fraud and anomaly detection.
● Internet of Things (IoT) real-time analytics.
● Real-time personalization, marketing, and
advertising.
6. Evolution of Stream Processing
beginning 1970 Early 2000s 2015 Current
Fortran / C
Started Simple processing
SQL and RDBMs
Databases invented
Batch
processing
Bulk processing and Big
Data like Map-Reduce
Streaming or
Micro Batching
Stream processing started
showing promises
Streaming SQL
Unified
Processing
13. Stream Processing Challenges
3 4
Late Data
Data received at a later time
than the actual event time
Deduplication
Removing duplicate data in
stream
1 2
Stream Joins
Joining Data from two
Streams
Aggregations
Aggregation operations for
SQL
5 6
Fast Incoming
DataStream Processor
Not Upto Speed
Fault tolerance
This paragraph actually is a
good place for title
description
14. Solution in Streaming
1 2
3 4
5 6
Stream Joins
Managing state in Streaming
Watermarking
Late Data
Managing state in Streaming
window, watermarking
Fast Incoming
DataBackpressure
Aggregations
Apply grouping (window)
and watermarking
Deduplication
Managing state in Streaming
with watermarking
Fault tolerance
Checkpointing
18. Stateless Stream Processing
What
This streaming is the straight
forward streaming we don’t need to
maintain state
Where
Where we need to perform some
operation per individual
message/event like filter, select, etc
When
when result is not dependent upon
previous events
19. Stateful Stream Processing
What
This stream is maintaining the state to
perform Aggregations, Deduplication,
Joins
Where
where we need to perform
operations like groupBy, count, etc
When
When result is dependent upon
previous events.
22. Windowing
01
02
03
This is simplest window.. This
window is pretty straight forward
We can perform both
windows by with respect to
the processing time and
event time.
This is window is bit complex then the Fixed
window. This window gives us two insights
like window and slice
Windowing by processing time vs
event time
Fixed window (Tumbling windows)
Sliding window (Hopping
windows)
24. Windowing by Processing Time vs Event
Time
Processing Time Window
● Processing time window is based upon
the clock time window.
● All the late events will keep into current
window
● Do not reorder the out of order events
Event Time Window
● Event time window is based upon time
when event get produced
● Event will be keep in the belonging
window.
● Re-order the out of order events.
25. Fixed Window (Tumbling Window)
Fixed/tumbling: time is partitioned into same-length,
non-overlapping chunks. Each event belongs to exactly one
window
27. Sliding Window (Hopping Window)
Sliding: windows have fixed length, but are separated by a time
interval (step) which can be smaller than the window length. Typically
the window interval is a multiplicity of the step. Each event belongs to
a number of windows ([window interval]/[window step]).
30. Watermarking
● Data newer than watermark may be late, but allowed
to aggregate
● Windows older than watermark automatically deleted
to
limit the amount of intermediate state
Handle more late data -> Keep more state
Reduced the state -> Handle less lateness
32. Stateful Streaming: Deduplication
● Drop duplicate records in a Stream
● Specify Columns which uniquely identify
records
● State will store unique keys in stream and
drop any record matching the state
33. Stateful Streaming: Deduplication
● Too large Key Set in state for
deduplication will make the streaming
unstable
● Solution: Drop the state after a specified
period.
34. Stateful Streaming: Joins
● Each of the stream should buffer events in
state for matching any future events of other
stream.
36. Stateful Streaming: Joins
● Impressions can be 2 hours late
● Clicks can be 3 hours late
● Clicks can occur within 1 hour after the
corresponding impression
37. Some Use case of Streaming
● Algorithmic Trading, Stock Market Surveillance,
● Smart Patient Care
● Monitoring a production line
● Supply chain optimizations
● Intrusion, Surveillance and Fraud Detection ( e.g. Uber)
● Most Smart Device Applications: Smart Car, Smart Home ..
● Smart Grid — (e.g. load prediction and outlier plug detection see Smart grids, 4 Billion events, throughout in range
of 100Ks)
● Traffic Monitoring, Geofencing, Vehicle, and Wildlife tracking — e.g. TFL London Transport Management System
● Sports analytics — Augment Sports with real-time analytics (e.g. this is a work we did with a real football game (e.g.
Overlaying real time analytics on Football Broadcasts)
● Context-aware promotions and advertising
● Computer system and network monitoring
● Predictive Maintenance, (e.g. Machine Learning Techniques for Predictive Maintenance)
● Geospatial data processing