SlideShare une entreprise Scribd logo
1  sur  32
Aljoscha Krettek, Software Engineer
Stream Processing for the
Practitioner: Blueprints for Common
Stream Processing Use Cases with
Apache Flink®
© 2018 data Artisans2
What is stream processing and
why‘s it useful?
© 2018 data Artisans3
Use Case: “Suspicious Behaviour“ Detection
Some service or
app
• Dropbox, Google Suite, Box
• Sharing, accessing, and modifying
data produces events that we
can/want to analyse
• Banking
• We monitor all transactions, know
the user data
© 2018 data Artisans4
Use Case: “Suspicious Behaviour“ Detection
Some service or
app
• Spits out alerts when suspicious stuf
is happening
• “More than 10 failed login
attempts”
• ”Sharing more than 100 files
within 1 Hour”
• “ImpossibleTravel”/”Magic
carpet travel”
• “Continuously increasing
withdrawal amounts”
“Magic” rules
engine
© 2018 data Artisans5
Use Case: “Suspicious Behaviour“ Detection
• What if my workload becomes way too high?  I’ll use a distributed batch
processor. (Hadoop MapReduce, Spark, and the like)
• This runs nightly? Isn’t the latency very high then? I don’t want to wait a day
for my alerts 😳
Ok, let me just use a single machine for this!
I need alerts in real time  Apache Flink® is a real-time, distributed,
stateful, and fault-tolerant stream processor
© 2018 data Artisans6
Detour: Thinking in Flows
TransformFilter
Data
Ingress
Apply
Rules
Persist to
database
Send
Alerts
source
sink
State
user operations/user code
flow of data / events
© 2018 data Artisans7
Why flows and operations?
• Operations can be composed and are reusable
• Makes it easy to think about the flow of data
• Allows a system (Flink) to take these operations and execute them on
different machines
• A system can execute the same operation multiple times on several
machines to deal with high workloads
© 2018 data Artisans8
Apache Flink® distributed stateful stream processor
Machine 1
Machine 2
Flink has nice APIs for writing these!
Note how we have the
“expensive” stateful
operation twice
© 2018 data Artisans9
Apache Flink® distributed stateful stream processor
• Questions that a good stream processing systems needs to have answers
for:
‒What happens when machines fail or when user operations fail?
‒What happens if I need to move my stateful operations/flows/jobs?
‒What happens if I need to change the schema of the state that
operations keep?
‒How can I update framework code while keeping my program state?
‒Same for user code?
© 2018 data Artisans10
Common stream processing
blueprints
© 2018 data Artisans11
Blueprint: Aggregation of timestamped data
• Use cases
‒Give me the number of tweet impressions per tweet for every
hour/day/…
‒Calculate the average temperature over 10 minute intervals for each
sensor in my warehouse
‒Aggregate user interaction data for my website to display on my internal
dashboards
© 2018 data Artisans12
Blueprint: Aggregation of timestamped data
windowed
aggregation
source sink
state: contents of all the in-flight windows
© 2018 data Artisans13
Blueprint: Aggregation of timestamped data
• Do I want to window by event-time or processing time?
• If using event-time, how do I know when my window is “done”?
• What happens if data arrives out of order with respect to their timestamp?
• If using event-time, when is data considered late?
• What should happen with late data?
Some things to look out for.
© 2018 data Artisans14
Blueprint: Aggregation of timestamped data
• Windowing API
• Timestamp assigners/watermark extractors for defining event-time and
defining “readiness”
• Allowed lateness for defining when data is late
• Side output of late data as a special flow path
Flink features to look at.
© 2018 data Artisans15
Blueprint: Aggregation of timestamped data
windowed
aggregation
kinesis
write to
Elastic
alert real
humanslate data
allowed lateness: 10 min
extract timestamps/watermarks
side output
© 2018 data Artisans16
Blueprint: Enriching data with “side input”
• Use cases
‒Enrich user events with known user data
‒Add geolocation information to geotagged events
© 2018 data Artisans17
Blueprint: Enriching data with “side input”
filter enrich
© 2018 data Artisans18
Blueprint: Enriching data with “side input”
filter enrich
Naïve approach
synchronous access to
external data store for
every element
© 2018 data Artisans19
Blueprint: Enriching data with “side input”
filter enrich
Slightly better approach
asynchronous access to
external data store for
every element
© 2018 data Artisans20
Blueprint: Enriching data with “side input”
filter enrich
“Next-level” approach
keep the enrichment
data in Flink state itselfchangelog
input
© 2018 data Artisans21
Blueprint: Enriching data with “side input”
• Regular user functions/operations
• Async I/O operation for more efficient data store accesses
• Two-input operations and stateful operations
Flink features to look at.
© 2018 data Artisans22
Blueprint: Dynamic processing
• Use cases
‒Update of processing rules via DSL, think dynamic fraud-detection
rules/policies
‒Live-update of machine learning models
© 2018 data Artisans23
Blueprint: Dynamic processing
pre-
processing
dynamic
processing
rules
input
broadcast
stream
broadcast
state
© 2018 data Artisans24
Blueprint: Dynamic processing
• ProcessFunction
• Broadcast streams and broadcast state
Flink features to look at.
© 2018 data Artisans25
Closing
© 2018 data Artisans26
Learnings
• When you want immediate results you probably need a stream processor
• Getting state, fault-tolerance, and event-time right is tough, check what
your stream processor has as answers for the though questions
• Start thinking in terms of data flows and reusable operations that you can
mix and match
• Flink has your use cases covered
Thank you!
aljoscha@apache.org
@dataArtisans
@ApacheFlink
We are hiring!
data-artisans.com/careers
© 2018 data Artisans29
About Data Artisans
Original creators of
Apache Flink®
Open Source Apache Flink
+ dA Application Manager
© 2018 data Artisans30
dA platform
data-artisans.com/download
© 2018 data Artisans31
Powered by Apache Flink
© 2018 data Artisans32
Download the free book
info.data-artisans.com/book

Contenu connexe

Tendances

Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpJoseph Arriola
 
Spark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit
 
Snaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in MotionSnaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in MotionSnapLogic
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks
 
Real-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkReal-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkSingleStore
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
 
2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modelingyalisassoon
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataVoltDB
 
Driving the On-Demand Economy with Spark and Predictive Analytics
Driving the On-Demand Economy with Spark and Predictive AnalyticsDriving the On-Demand Economy with Spark and Predictive Analytics
Driving the On-Demand Economy with Spark and Predictive AnalyticsSingleStore
 
ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData BerlinShimon Tolts
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event datayalisassoon
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
Real-Time Geospatial Intelligence at Scale
Real-Time Geospatial Intelligence at Scale Real-Time Geospatial Intelligence at Scale
Real-Time Geospatial Intelligence at Scale SingleStore
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTSingleStore
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark Summit
 

Tendances (20)

Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Spark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun Connolly
 
Snaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in MotionSnaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in Motion
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
 
Real-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkReal-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and Spark
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time Data
 
Ford
FordFord
Ford
 
Driving the On-Demand Economy with Spark and Predictive Analytics
Driving the On-Demand Economy with Spark and Predictive AnalyticsDriving the On-Demand Economy with Spark and Predictive Analytics
Driving the On-Demand Economy with Spark and Predictive Analytics
 
ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData Berlin
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event data
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Real-Time Geospatial Intelligence at Scale
Real-Time Geospatial Intelligence at Scale Real-Time Geospatial Intelligence at Scale
Real-Time Geospatial Intelligence at Scale
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoT
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony Baer
 

Similaire à Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink

Advanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAljoscha Krettek
 
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHINGBig Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHINGMatt Stubbs
 
The Past, Present, and Future of Apache Flink®
The Past, Present, and Future of Apache Flink®The Past, Present, and Future of Apache Flink®
The Past, Present, and Future of Apache Flink®Aljoscha Krettek
 
The Past, Present, and Future of Apache Flink
The Past, Present, and Future of Apache FlinkThe Past, Present, and Future of Apache Flink
The Past, Present, and Future of Apache FlinkAljoscha Krettek
 
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Kai Wähner
 
Real Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalReal Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalVMware Tanzu Korea
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentCA | Automic Software
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service ManagementIBM Danmark
 
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...Grid Dynamics
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
Потоковая обработка больших данных
Потоковая обработка больших данныхПотоковая обработка больших данных
Потоковая обработка больших данныхCEE-SEC(R)
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
 
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...Grid Dynamics
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 

Similaire à Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink (20)

Advanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applications
 
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHINGBig Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
 
The Past, Present, and Future of Apache Flink®
The Past, Present, and Future of Apache Flink®The Past, Present, and Future of Apache Flink®
The Past, Present, and Future of Apache Flink®
 
The Past, Present, and Future of Apache Flink
The Past, Present, and Future of Apache FlinkThe Past, Present, and Future of Apache Flink
The Past, Present, and Future of Apache Flink
 
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
 
Real Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalReal Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from Pivotal
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop Agent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service Management
 
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Потоковая обработка больших данных
Потоковая обработка больших данныхПотоковая обработка больших данных
Потоковая обработка больших данных
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 

Plus de Aljoscha Krettek

Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Aljoscha Krettek
 
The Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data ProcessingThe Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data ProcessingAljoscha Krettek
 
(Past), Present, and Future of Apache Flink
(Past), Present, and Future of Apache Flink(Past), Present, and Future of Apache Flink
(Past), Present, and Future of Apache FlinkAljoscha Krettek
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkAljoscha Krettek
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache FlinkAljoscha Krettek
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Aljoscha Krettek
 
Apache Flink - A Stream Processing Engine
Apache Flink - A Stream Processing EngineApache Flink - A Stream Processing Engine
Apache Flink - A Stream Processing EngineAljoscha Krettek
 
Adventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and WindowsAdventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and WindowsAljoscha Krettek
 
Flink 0.10 - Upcoming Features
Flink 0.10 - Upcoming FeaturesFlink 0.10 - Upcoming Features
Flink 0.10 - Upcoming FeaturesAljoscha Krettek
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Aljoscha Krettek
 

Plus de Aljoscha Krettek (12)

Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
 
The Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data ProcessingThe Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data Processing
 
(Past), Present, and Future of Apache Flink
(Past), Present, and Future of Apache Flink(Past), Present, and Future of Apache Flink
(Past), Present, and Future of Apache Flink
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache Flink
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
 
Apache Flink - A Stream Processing Engine
Apache Flink - A Stream Processing EngineApache Flink - A Stream Processing Engine
Apache Flink - A Stream Processing Engine
 
Adventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and WindowsAdventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and Windows
 
Flink 0.10 - Upcoming Features
Flink 0.10 - Upcoming FeaturesFlink 0.10 - Upcoming Features
Flink 0.10 - Upcoming Features
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)
 
Apache Flink Hands-On
Apache Flink Hands-OnApache Flink Hands-On
Apache Flink Hands-On
 

Dernier

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink

  • 1. Aljoscha Krettek, Software Engineer Stream Processing for the Practitioner: Blueprints for Common Stream Processing Use Cases with Apache Flink®
  • 2. © 2018 data Artisans2 What is stream processing and why‘s it useful?
  • 3. © 2018 data Artisans3 Use Case: “Suspicious Behaviour“ Detection Some service or app • Dropbox, Google Suite, Box • Sharing, accessing, and modifying data produces events that we can/want to analyse • Banking • We monitor all transactions, know the user data
  • 4. © 2018 data Artisans4 Use Case: “Suspicious Behaviour“ Detection Some service or app • Spits out alerts when suspicious stuf is happening • “More than 10 failed login attempts” • ”Sharing more than 100 files within 1 Hour” • “ImpossibleTravel”/”Magic carpet travel” • “Continuously increasing withdrawal amounts” “Magic” rules engine
  • 5. © 2018 data Artisans5 Use Case: “Suspicious Behaviour“ Detection • What if my workload becomes way too high?  I’ll use a distributed batch processor. (Hadoop MapReduce, Spark, and the like) • This runs nightly? Isn’t the latency very high then? I don’t want to wait a day for my alerts 😳 Ok, let me just use a single machine for this! I need alerts in real time  Apache Flink® is a real-time, distributed, stateful, and fault-tolerant stream processor
  • 6. © 2018 data Artisans6 Detour: Thinking in Flows TransformFilter Data Ingress Apply Rules Persist to database Send Alerts source sink State user operations/user code flow of data / events
  • 7. © 2018 data Artisans7 Why flows and operations? • Operations can be composed and are reusable • Makes it easy to think about the flow of data • Allows a system (Flink) to take these operations and execute them on different machines • A system can execute the same operation multiple times on several machines to deal with high workloads
  • 8. © 2018 data Artisans8 Apache Flink® distributed stateful stream processor Machine 1 Machine 2 Flink has nice APIs for writing these! Note how we have the “expensive” stateful operation twice
  • 9. © 2018 data Artisans9 Apache Flink® distributed stateful stream processor • Questions that a good stream processing systems needs to have answers for: ‒What happens when machines fail or when user operations fail? ‒What happens if I need to move my stateful operations/flows/jobs? ‒What happens if I need to change the schema of the state that operations keep? ‒How can I update framework code while keeping my program state? ‒Same for user code?
  • 10. © 2018 data Artisans10 Common stream processing blueprints
  • 11. © 2018 data Artisans11 Blueprint: Aggregation of timestamped data • Use cases ‒Give me the number of tweet impressions per tweet for every hour/day/… ‒Calculate the average temperature over 10 minute intervals for each sensor in my warehouse ‒Aggregate user interaction data for my website to display on my internal dashboards
  • 12. © 2018 data Artisans12 Blueprint: Aggregation of timestamped data windowed aggregation source sink state: contents of all the in-flight windows
  • 13. © 2018 data Artisans13 Blueprint: Aggregation of timestamped data • Do I want to window by event-time or processing time? • If using event-time, how do I know when my window is “done”? • What happens if data arrives out of order with respect to their timestamp? • If using event-time, when is data considered late? • What should happen with late data? Some things to look out for.
  • 14. © 2018 data Artisans14 Blueprint: Aggregation of timestamped data • Windowing API • Timestamp assigners/watermark extractors for defining event-time and defining “readiness” • Allowed lateness for defining when data is late • Side output of late data as a special flow path Flink features to look at.
  • 15. © 2018 data Artisans15 Blueprint: Aggregation of timestamped data windowed aggregation kinesis write to Elastic alert real humanslate data allowed lateness: 10 min extract timestamps/watermarks side output
  • 16. © 2018 data Artisans16 Blueprint: Enriching data with “side input” • Use cases ‒Enrich user events with known user data ‒Add geolocation information to geotagged events
  • 17. © 2018 data Artisans17 Blueprint: Enriching data with “side input” filter enrich
  • 18. © 2018 data Artisans18 Blueprint: Enriching data with “side input” filter enrich Naïve approach synchronous access to external data store for every element
  • 19. © 2018 data Artisans19 Blueprint: Enriching data with “side input” filter enrich Slightly better approach asynchronous access to external data store for every element
  • 20. © 2018 data Artisans20 Blueprint: Enriching data with “side input” filter enrich “Next-level” approach keep the enrichment data in Flink state itselfchangelog input
  • 21. © 2018 data Artisans21 Blueprint: Enriching data with “side input” • Regular user functions/operations • Async I/O operation for more efficient data store accesses • Two-input operations and stateful operations Flink features to look at.
  • 22. © 2018 data Artisans22 Blueprint: Dynamic processing • Use cases ‒Update of processing rules via DSL, think dynamic fraud-detection rules/policies ‒Live-update of machine learning models
  • 23. © 2018 data Artisans23 Blueprint: Dynamic processing pre- processing dynamic processing rules input broadcast stream broadcast state
  • 24. © 2018 data Artisans24 Blueprint: Dynamic processing • ProcessFunction • Broadcast streams and broadcast state Flink features to look at.
  • 25. © 2018 data Artisans25 Closing
  • 26. © 2018 data Artisans26 Learnings • When you want immediate results you probably need a stream processor • Getting state, fault-tolerance, and event-time right is tough, check what your stream processor has as answers for the though questions • Start thinking in terms of data flows and reusable operations that you can mix and match • Flink has your use cases covered
  • 27.
  • 29. © 2018 data Artisans29 About Data Artisans Original creators of Apache Flink® Open Source Apache Flink + dA Application Manager
  • 30. © 2018 data Artisans30 dA platform data-artisans.com/download
  • 31. © 2018 data Artisans31 Powered by Apache Flink
  • 32. © 2018 data Artisans32 Download the free book info.data-artisans.com/book

Notes de l'éditeur

  1. - Operations can be developed by different teams because of separation of concerns
  2. Explain how Flink has API for starting multiple I/O requests, keeping track of order and making sure that things are fault-tolerant.
  3. Explain how Flink has API for starting multiple I/O requests, keeping track of order and making sure that things are fault-tolerant.
  4. (Keep this slide up during the Q&A part of your talk. Having this up in the final 5-10 minutes of the session gives the audience something useful to look at.)
  5. • data Artisans was founded by the original creators of Apache Flink • We provide dA Platform, a complete stream processing infrastructure with open-source Apache Flink
  6. • Also included is the Application Manager, which turns dA Platform into a self-service platform for stateful stream processing applications. • dA Platform is generally available, and you can download a free trial today!
  7. • These companies are among many users of Apache Flink, and during this conference you’ll meet folks from some of these companies as well as others using Flink. • If your company would like to be represented on the “Powered by Apache Flink” page, email me.
  8. (Optional slide – may not be appropriate for advanced audience. Helps us capture leads.)