SlideShare une entreprise Scribd logo
1  sur  66
Télécharger pour lire hors ligne
From stream data management
To distributed dataflows
And beyond...
Vasiliki (vasia) Kalavri
(vkalavri@bu.edu)
Stream processing is an established technology in the
data analytics stack of the modern business
3
4
4
4
5
Traffic light adjustment in real time
Alibaba City Brain analyzes
vehicle locations to:

• clear paths for emergency
response vehicles

• provide scheduling information
for public transport

• recommend alternative routes
Read more: https://edition.cnn.com/2019/01/15/tech/alibaba-city-brain-hangzhou/index.html
6
Fault-detection for NASA’s Deep
Space Network
NASA’s DSN Complex Event Processing
analyzes real-time network data, predicted
antenna pointing parameters, and physical
hardware logs to:

• ingest, filter, store, and visualize all of the
DSN's monitor and control data

• ensure the successful DSN tracking,
ranging, and communication integrity of
dozens of concurrent deep-space missions
Read more: https://www.confluent.io/kafka-summit-san-francisco-2019/mission-critical-real-time-
fault-detection-for-nasas-deep-space-network-using-apache-kafka/ 7
• How did we get here?
• Are we there yet?
• What lies ahead?
9
SIGMOD ’92
9
SIGMOD ’92
[… A new class of queries, continuous queries, are similar to
conventional database queries, except that they are issued once and
henceforth run “continually” over the database …]
9
10
1992 20132004
Tapestry
20202000 2002
10
1992 20132004
Tapestry
20202000 2002
Aurora
TelegraphCQ
STREAM
GigascopeNiagaraCQ
10
1992 20132004
Tapestry
20202000 2002
Aurora
TelegraphCQ
STREAM
GigascopeNiagaraCQ
Data Stream Management Systems
Synopsis Maintenance
DSMS architecture
Synopsis
for S1
Synopsis
for Sr
…
Fast
approximate
answers
…
S1
S2
Sr
11
InputManager
Scheduler
QoS Monitor
Load Shedder
Query
Execution
Engine
QmQ2Q1
Ad-hoc or
continuous queries
Input streams
…
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
GigascopeNiagaraCQ
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
GigascopeNiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
GigascopeNiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
NiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
NiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
“Best-effort”
low-latency stream processor
λ-architecture
MapReduce /
Batch processing
system
Fast
approximate
results
13
InputManager
Input data
Persistent
storage
Slow
exact
results
Applications
Speed layer
Batch layer
14
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
2015
NiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
14
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
2015
Distributed Dataflow Systems
NiagaraCQ
Spark Streaming
Naiad
Flink
Millwheel
Google Dataflow
Timely Dataflow
Samza
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
Stream processing doesn’t necessarily need to
be approximate and lossy
Worker
Task Task
state store
Task
DDS architecture
Streaming APIs
Distributed File System
Coordinator
Worker
Task Task Task
Worker
Task Task Task
TCP
output to
application
and sinks
16
Event logs
Socket
TCP
(Q, config)
client
schedule
trigger
checkpoint
status
put/get
checkpoint
Spark Streaming
17
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
Spark Streaming
17
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
data parallelism
exactly-once fault-tolerance
state management
general-purpose languages
iterations UDFs
Spark Streaming
17
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
Are we there yet?
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
data parallelism
exactly-once fault-tolerance
state management
general-purpose languages
iterations UDFs
18
SIGMOD
Record ’05
1. Process events online without storing them
18
SIGMOD
Record ’05
1. Process events online without storing them
18
SIGMOD
Record ’05
persistently store events and state
1. Process events online without storing them
2. Support a high-level language (SQL-like)
18
SIGMOD
Record ’05
persistently store events and state
1. Process events online without storing them
2. Support a high-level language (SQL-like)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
Some of my recent
and ongoing work
19
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
Some of my recent
and ongoing work
19
Automatic elasticity and reconfiguration
20
heuristic policies

if CPU > 80% => scale
stop-and-restart
migration and
reconfiguration
Automatic elasticity and reconfiguration
21
Accuracy: no over/under-provisioning
Stability:no oscillations
Performance: fast convergence
Safe migration: correct results
Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows (OSDI ’18). 

Megaphone: Latency-conscious state migration for distributed streaming dataflows (VLDB’19).
github.com/strymon-system/ds2
github.com/strymon-system/megaphone
o1 cannot keep up
waiting for output
waiting for input
src
o1
o2
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
Performance analysis of
streaming dataflows is itself a
challenging streaming
computation with strict latency
requirements
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
Performance analysis of
streaming dataflows is itself a
challenging streaming
computation with strict latency
requirements
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
Snailtrail: Generalizing critical paths for online analysis of distributed dataflows (NSDI’18).
github.com/li1/snailtrail
1. Process events online without storing them

2. Support a high-level language (SQL-like)

3. Handle missing, out-of-order, delayed data

4. Guarantee deterministic (on replay) and correct results (on recovery)

5. Combine batch and stream processing

6. Ensure availability despite failures

7. Support distribution and automatic elasticity

8. Offer low-latency
23
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
accurate, stable, latency-aware
reliability, production readiness and community can be
more important than raw performance
In open-source software
24
reliability, production readiness and community can be
more important than raw performance
In open-source software
24
Apache Flink, Nexmark Q4
latency (ms)
CDF
1.0
0.8
0.6
0.4
0.2
0.0
In-memory
state RocksDB state
1000080006000400020000
serde at every access
25
write-heavy, large state
RMW a single value
globally configured store
25
write-heavy, large state
RMW a single value
globally configured store
Type-aware, flexible state
management provides up to an order
of magnitude latency improvement
We need configurable streaming backends
New streaming state benchmarks
Beyond…
Model serving
27
Stream Processor Model Server
RPC
input
stream
predictions
Stream Processor
op
input
stream
predictions
Model management and versioning
1. Model stored externally 2. Model stored in managed state
Exactly-once guarantees?
Latency trade-offs unclear
What kind of state store to use?
Stateful serverless (FaaS)
28
Automatic scaling
Function orchestration
Support for transactions
External requests
Events and
function triggers
f
λ
f
f
f
output
Apache Flink Stateful Functions: https://statefun.io
Stateful Functions as a Service in Action (VLDB’19)
Graph streaming & online trainingdatarate
analytics complexity
low
high
low high
Machine

Learning
Data

Mining
Streaming
CEP
Relational

analytics
Graph processing
Complex streaming
data analytics
Streaming Graph Partitioning: An Experimental Study (VLDB’18).
Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism (arxiv.org/abs/1912.12740).
29
Graph state management
Data-parallel graph synopses
Languages & operator semantics
Adaptive graph partitioning
Spark Streaming
30
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
ML
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
data parallelism
exactly-once fault-tolerance
state management
general-purpose languages
iterations UDFs
Graphs
FaaS
Edge
Modern hardware
From stream data management
To distributed dataflows
And beyond...
Vasiliki (vasia) Kalavri
(vkalavri@bu.edu)

Contenu connexe

Tendances

Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsNeil Avery
 
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentMaking Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentHostedbyConfluent
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...HostedbyConfluent
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...StreamNative
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Kai Wähner
 
Real-World Pulsar Architectural Patterns
Real-World Pulsar Architectural PatternsReal-World Pulsar Architectural Patterns
Real-World Pulsar Architectural PatternsDevin Bost
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...confluent
 
Neo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache KafkaNeo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache Kafkajexp
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
The Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyondThe Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyondSolace
 
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlowIoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlowKai Wähner
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...HostedbyConfluent
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)Kai Wähner
 
Concepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with KafkaConcepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with KafkaQAware GmbH
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaKai Wähner
 

Tendances (20)

Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-events
 
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentMaking Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
 
Real-World Pulsar Architectural Patterns
Real-World Pulsar Architectural PatternsReal-World Pulsar Architectural Patterns
Real-World Pulsar Architectural Patterns
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
Neo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache KafkaNeo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache Kafka
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
The Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyondThe Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyond
 
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlowIoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)
 
Concepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with KafkaConcepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with Kafka
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
 

Similaire à From data stream management to distributed dataflows and beyond

Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Eugene
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology OverviewDan Lynn
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics Franco Ucci
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
[WSO2Con USA 2018] The Rise of Streaming SQL
[WSO2Con USA 2018] The Rise of Streaming SQL[WSO2Con USA 2018] The Rise of Streaming SQL
[WSO2Con USA 2018] The Rise of Streaming SQLWSO2
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case StudyHeinrich Hartmann
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker seriesMonal Daxini
 
Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020b0ris_1
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...confluent
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Safe Peak Technical Ppt W Product Publish
Safe Peak Technical Ppt W Product   PublishSafe Peak Technical Ppt W Product   Publish
Safe Peak Technical Ppt W Product Publishsqlserver.co.il
 

Similaire à From data stream management to distributed dataflows and beyond (20)

Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
 
[WSO2Con USA 2018] The Rise of Streaming SQL
[WSO2Con USA 2018] The Rise of Streaming SQL[WSO2Con USA 2018] The Rise of Streaming SQL
[WSO2Con USA 2018] The Rise of Streaming SQL
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Safe Peak Technical Ppt W Product Publish
Safe Peak Technical Ppt W Product   PublishSafe Peak Technical Ppt W Product   Publish
Safe Peak Technical Ppt W Product Publish
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Resume
ResumeResume
Resume
 

Plus de Vasia Kalavri

Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingVasia Kalavri
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonVasia Kalavri
 
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...Vasia Kalavri
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight lineVasia Kalavri
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraGraphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri
 
Demystifying Distributed Graph Processing
Demystifying Distributed Graph ProcessingDemystifying Distributed Graph Processing
Demystifying Distributed Graph ProcessingVasia Kalavri
 
Like a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web TrackersLike a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkVasia Kalavri
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedVasia Kalavri
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceVasia Kalavri
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and ReuseVasia Kalavri
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 
A Skype case study (2011)
A Skype case study (2011)A Skype case study (2011)
A Skype case study (2011)Vasia Kalavri
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupVasia Kalavri
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
 

Plus de Vasia Kalavri (19)

Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with Strymon
 
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight line
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraGraphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
 
Demystifying Distributed Graph Processing
Demystifying Distributed Graph ProcessingDemystifying Distributed Graph Processing
Demystifying Distributed Graph Processing
 
Like a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web TrackersLike a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web Trackers
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, Explained
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
A Skype case study (2011)
A Skype case study (2011)A Skype case study (2011)
A Skype case study (2011)
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area Meetup
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 

Dernier

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 

Dernier (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 

From data stream management to distributed dataflows and beyond

  • 1. From stream data management To distributed dataflows And beyond... Vasiliki (vasia) Kalavri (vkalavri@bu.edu)
  • 2. Stream processing is an established technology in the data analytics stack of the modern business
  • 3. 3
  • 4. 4
  • 5. 4
  • 6. 4
  • 7. 5
  • 8. Traffic light adjustment in real time Alibaba City Brain analyzes vehicle locations to: • clear paths for emergency response vehicles • provide scheduling information for public transport • recommend alternative routes Read more: https://edition.cnn.com/2019/01/15/tech/alibaba-city-brain-hangzhou/index.html 6
  • 9. Fault-detection for NASA’s Deep Space Network NASA’s DSN Complex Event Processing analyzes real-time network data, predicted antenna pointing parameters, and physical hardware logs to: • ingest, filter, store, and visualize all of the DSN's monitor and control data • ensure the successful DSN tracking, ranging, and communication integrity of dozens of concurrent deep-space missions Read more: https://www.confluent.io/kafka-summit-san-francisco-2019/mission-critical-real-time- fault-detection-for-nasas-deep-space-network-using-apache-kafka/ 7
  • 10. • How did we get here? • Are we there yet? • What lies ahead?
  • 11. 9
  • 13. SIGMOD ’92 [… A new class of queries, continuous queries, are similar to conventional database queries, except that they are issued once and henceforth run “continually” over the database …] 9
  • 17. Synopsis Maintenance DSMS architecture Synopsis for S1 Synopsis for Sr … Fast approximate answers … S1 S2 Sr 11 InputManager Scheduler QoS Monitor Load Shedder Query Execution Engine QmQ2Q1 Ad-hoc or continuous queries Input streams …
  • 18. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 GigascopeNiagaraCQ
  • 19. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 GigascopeNiagaraCQ operator semantics event time & progress representations synopses & sketches
  • 20. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 GigascopeNiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 21. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce NiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 22. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 NiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 23. “Best-effort” low-latency stream processor λ-architecture MapReduce / Batch processing system Fast approximate results 13 InputManager Input data Persistent storage Slow exact results Applications Speed layer Batch layer
  • 24. 14 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 2015 NiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 25. 14 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 2015 Distributed Dataflow Systems NiagaraCQ Spark Streaming Naiad Flink Millwheel Google Dataflow Timely Dataflow Samza operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 26. Stream processing doesn’t necessarily need to be approximate and lossy
  • 27. Worker Task Task state store Task DDS architecture Streaming APIs Distributed File System Coordinator Worker Task Task Task Worker Task Task Task TCP output to application and sinks 16 Event logs Socket TCP (Q, config) client schedule trigger checkpoint status put/get checkpoint
  • 28. Spark Streaming 17 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 29. Spark Streaming 17 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow operator semantics event time & progress representations synopses & sketches load management high availability scheduling data parallelism exactly-once fault-tolerance state management general-purpose languages iterations UDFs
  • 30. Spark Streaming 17 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow Are we there yet? operator semantics event time & progress representations synopses & sketches load management high availability scheduling data parallelism exactly-once fault-tolerance state management general-purpose languages iterations UDFs
  • 32. 1. Process events online without storing them 18 SIGMOD Record ’05
  • 33. 1. Process events online without storing them 18 SIGMOD Record ’05 persistently store events and state
  • 34. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 18 SIGMOD Record ’05 persistently store events and state
  • 35. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like
  • 36. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like
  • 37. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs
  • 38. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs
  • 39. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once
  • 40. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once
  • 41. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming
  • 42. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming
  • 43. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 44. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 45. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 46. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 47. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates high throughput and “acceptable" latency
  • 48. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates high throughput and “acceptable" latency
  • 49. Some of my recent and ongoing work 19
  • 50. Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor Some of my recent and ongoing work 19
  • 51. Automatic elasticity and reconfiguration 20 heuristic policies if CPU > 80% => scale stop-and-restart migration and reconfiguration
  • 52. Automatic elasticity and reconfiguration 21 Accuracy: no over/under-provisioning Stability:no oscillations Performance: fast convergence Safe migration: correct results Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows (OSDI ’18). 
 Megaphone: Latency-conscious state migration for distributed streaming dataflows (VLDB’19). github.com/strymon-system/ds2 github.com/strymon-system/megaphone o1 cannot keep up waiting for output waiting for input src o1 o2
  • 53. Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor 22
  • 54. Performance analysis of streaming dataflows is itself a challenging streaming computation with strict latency requirements Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor 22
  • 55. Performance analysis of streaming dataflows is itself a challenging streaming computation with strict latency requirements Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor 22 Snailtrail: Generalizing critical paths for online analysis of distributed dataflows (NSDI’18). github.com/li1/snailtrail
  • 56. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 23 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates high throughput and “acceptable" latency accurate, stable, latency-aware
  • 57. reliability, production readiness and community can be more important than raw performance In open-source software 24
  • 58. reliability, production readiness and community can be more important than raw performance In open-source software 24 Apache Flink, Nexmark Q4 latency (ms) CDF 1.0 0.8 0.6 0.4 0.2 0.0 In-memory state RocksDB state 1000080006000400020000 serde at every access
  • 59. 25 write-heavy, large state RMW a single value globally configured store
  • 60. 25 write-heavy, large state RMW a single value globally configured store Type-aware, flexible state management provides up to an order of magnitude latency improvement We need configurable streaming backends New streaming state benchmarks
  • 62. Model serving 27 Stream Processor Model Server RPC input stream predictions Stream Processor op input stream predictions Model management and versioning 1. Model stored externally 2. Model stored in managed state Exactly-once guarantees? Latency trade-offs unclear What kind of state store to use?
  • 63. Stateful serverless (FaaS) 28 Automatic scaling Function orchestration Support for transactions External requests Events and function triggers f λ f f f output Apache Flink Stateful Functions: https://statefun.io Stateful Functions as a Service in Action (VLDB’19)
  • 64. Graph streaming & online trainingdatarate analytics complexity low high low high Machine Learning Data Mining Streaming CEP Relational analytics Graph processing Complex streaming data analytics Streaming Graph Partitioning: An Experimental Study (VLDB’18). Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism (arxiv.org/abs/1912.12740). 29 Graph state management Data-parallel graph synopses Languages & operator semantics Adaptive graph partitioning
  • 65. Spark Streaming 30 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow ML operator semantics event time & progress representations synopses & sketches load management high availability scheduling data parallelism exactly-once fault-tolerance state management general-purpose languages iterations UDFs Graphs FaaS Edge Modern hardware
  • 66. From stream data management To distributed dataflows And beyond... Vasiliki (vasia) Kalavri (vkalavri@bu.edu)