Crash course on data streaming (with examples using Apache Flink)

An introduction to
stream processing
Vincenzo Gulisano
vincenzo.gulisano@chalmers.se

Agenda
• Lecture 1
• Part 1 – Introduction and basics
• Part 2 – Distributed and parallel analysis
• Lecture 2
• Part 3 – Correctness guarantees
• Part 4 – One size DOES NOT fit all in performance
2

Part 1 –
Introduction
and basics
3

IoT enables for increased awareness, security, power-efficiency, ...
large IoT systems are complex
traditional data analysis techniques alone are not adequate!
4

Advanced Metering Infrastructures (AMIs)
Smart Grids
Vehicular Networks (VNs)
• demand-response
• scheduling
• micro-grids
• detection of medium size blackouts
• detection of non-technical losses
• ...
• autonomous driving
• platooning
• accident detection
• variable tolls
• congestion monitoring
• ...
IoT enables for increased awareness, security, power-efficiency, ...
5

AMIs VNs
large IoT systems are complex
Characteristics:
1. edge location
2. location awareness
3. low latency
4. geographical distribution
5. large-scale
6. support for mobility
7. real-time interactions
8. predominance of wireless
9. heterogeneous
10. interoperability / federation
11. interaction with the cloud
6

traditional data analysis techniques alone are not adequate!
1. does the infrastructure allow for billions of
readings per day to be transferred continuously?
2. the latency incurred while transferring data, does
that undermine the utility of the analysis?
3. is it secure to concentrate all the data in a single
place?
4. is it smart to give away fine-grained data?
7

A small example of what fine-grained data can reveal...
8
source: Andrés Molina-Markham, Prashant Shenoy, Kevin Fu, Emmanuel Cecchet, and David Irwin. 2010. Private memoirs of a
smart meter. In Proceedings of the 2nd ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building (BuildSys
’10). Association for Computing Machinery, New York, NY, USA, 61–66. DOI:https://doi.org/10.1145/1878431.1878446

a better answer we leverage the entire infrastructure!
9
Traditional analysis techniques cannot address all the
challenges in these setups
That’s where stream processing can make the difference!

Main Memory
Motivation
DBMS vs. DSMS
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
Query
Data
Query
results
11

Before we start... about data streaming and Stream Processing Engines (SPEs)
12
An incomplete, non-sorted list of SPEs:
time
Borealis
The Aurora Project
STanfordstREamdatAManager
NiagaraCQ
COUGAR
StreamCloud
Covering all of them / discussing which use cases are best for each one out of scope...

All documentation images / code snippets in the following are taken from: https://flink.apache.org/ 13

data stream: unbounded sequence of tuples sharing the same schema
14
Example: vehicles’ speed reports
time
Field Field
vehicle id text
time (secs) text
speed (Km/h) double
X coordinate double
Y coordinate double
A 8:00 55.5 X1 Y1
Let’s assume each source
(e.g., vehicle) produces
and delivers a
timestamp-sorted stream
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2

continuous query (or simply query): Directed Acyclic Graph (DAG) of
streams and operators
15
OP
OP
OP
OP OP
OP OP
source op
(1+ out streams)
sink op
(1+ in streams)
stream
op
(1+ in, 1+ out streams)

data streaming operators
Two main types:
• Stateless operators
• do not maintain any state
• one-by-one processing
• if they maintain some state, such state does not evolve depending
on the tuples being processed
• Stateful operators
• maintain a state that evolves depending on the tuples being
processed
• produce output tuples that depend on multiple input tuples
16
OP
OP

stateless operators
17
Filter
...
Map
Union
...
Filter / route tuples based on one (or more) conditions
Transform each tuple
Merge multiple streams (with the same schema) into one

stateless operators
18
Filter
...
Map
Union
...

stateful operators
19
Aggregate information
from multiple tuples
(e.g., max, min, sum, ...)
Join tuples coming
from 2 streams given a
certain predicate
Aggregate Join

Wait a moment!
if streams are unbounded, how can we aggregate or join?
20

windows and stateful analysis
Stateful operations are done over windows:
• Time-based (e.g., tuples in the last 10 minutes)
• Tuple-based (e.g., given the last 50 tuples)
21
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
Example of time-based window of size 1 hour and advance 20 minutes
How many tuple in a window?
Which time period does a window span?

time-based sliding window aggregation (count)
22
Counter: 4
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
Counter: 1
Counter: 2
Counter: 3
Counter: 3
time
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)

time-based sliding window joining
23
t1
t2
t3
t4
t1
t2
t3
t4
R S
Sliding
window Window
size WS
WSWR
Predicate P

24
windows and stateful analysis
See the part about Distributed and parallel
analysis to understand what this is

25
basic operators and user-defined operators
Besides a set of basic operators, SPEs usually allow the user to define
ad-hoc operators (e.g., when existing aggregation are not enough)

Part 2 – Distributed
and parallel analysis
26

sample query
For each vehicle, raise an alert if the speed of the latest report is more
than 2 times higher than its average speed in the last 30 days.
27
time
A 8:00 55.5 X1 Y1 A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2

28
Remove
unused fields
Map
Field
vehicle id
time (secs)
speed (Km/h)
X coordinate
Y coordinate
Field
vehicle id
time (secs)
speed (Km/h)
Compute average
speed for each
vehicle during the
last 30 days
Aggregate
Field
vehicle id
time (secs)
avg speed (Km/h)
Join
Check
condition
Filter
Field
vehicle id
time (secs)
speed (Km/h)
Join on
vehicle id
Field
vehicle id
time (secs)
avg speed (Km/h)
speed (Km/h)
sample query

29
M A J F
sample query
Notice:
• the same semantics can be defined in several ways (using different
operators and composing them in different ways)
• Using many basic building blocks can ease the task of distributing
and parallelizing the analysis

M A J F
At the edge, in parallel at each vehicle
M A J F
30

Distributed, edge + fog + cloud
A J
M
F
32

DISCLAIMER / BEFORE
WE START
I am using this symbol a lot in the
following
This is not necessarily a physical
computer, but a “processing unit”. That is,
a computer or a CPU or a core in a CPU or
a thread...
33

Centralized execution
34
M A J F

Research topics
(initially studied for centralized executions)
• Approximation
• Due to limited resources, how to approximate
results to reduce space or time complexity?
• Load Shedding
• If close to saturation, which information to discard
in order to maximize the Quality of Service?
• Operator scheduling
• In which order to run operators in order to
minimize an overhead / maximize a certain metric?
35

Distributed execution
36
Inter-operator parallelism
M A J F

Distributed execution
• Load Balancing: how to distribute / place operators to nodes in order
to maximize throughput?
• Offline load balancing
• Dynamic load balancing
37

Distributed execution  Fault Tolerance
• Existing techniques:
• Active standby
• Passive standby
• Upstream backup
• Guarantees:
• Precise
• “Eventual”
• Gap
38

Distributed execution – Fault Tolerance
• Active standby
Primary
Replica
AggOP
Agg
39
Cost:
Recovery
Time:

• Passive standby
Primary
Replica
AggOP
Agg
Periodic
checkpoints
40
Cost:
Recovery
Time:

• Passive standby
Primary
41
Cost:
Recovery
Time:
AggOP
Periodic
checkpoints
Disk

Notice!!!
Primary
Replica
AggOP
Agg
Periodic
checkpoints
AggOP
Periodic
checkpoints
Disk
Both techniques need to be able to replay
some of the previous data
42

• Upstream Backup
AggOP
Buffer
43
Cost:
Recovery
Time:

Parallel execution
44
Intra-operator parallelism
M A J F
… …

Parallel-Distributed execution
• Challenges:
• Semantic Transparency
• Results produced by parallel-distributed execution equal to centralized or distributed
execution
• Throughput maximization
• How to maximize throughput when distributing and parallelizing data streaming
operators?
45

• General approach
OPA OPB
46
Parallel execution

R: Router
M: Merger
OPA OPB
OPA RM
Thread 1
OPA RM
Thread m
…
47
Parallel execution
OPA RM
Thread 1 Thread 2 Thread 3
OPA RM
Thread 1 Thread 2
OPA RM
Thread 1 Thread 2
…

R: Router
M: Merger
OPA OPB
OPA RM
Thread 1
OPA RM
Thread m
…
48
Parallel execution
OPA RM
Thread 1 Thread 2 Thread 3
OPA RM
Thread 1 Thread 2
OPA RM
Thread 1 Thread 2
…

OPA OPB
OPA RM
Thread 1
OPA RM
Thread m
…
Subcluster A
49
Parallel execution
R: Router
M: Merger

OPA RM
Thread 1
OPA RM
Thread m
…
OPA OPB
OPB RM
Thread 1
OPB RM
Thread n
…
Subcluster A Subcluster B
50
Parallel execution
R: Router
M: Merger

OPA RM
Thread 1
OPA RM
Thread m
…
OPA OPB
OPB RM
Thread 1
OPB RM
Thread n
…
51
Parallel execution
R: Router
M: Merger

• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
Previous Subcluster
R…
R…
M Agg1
M Agg2
M Agg3
…
…
…
Vehicle A
52
Parallel execution

A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
Previous Subcluster
R…
R…
M Agg1
M Agg2
M Agg3
…
…
…
Vehicle A
53
Parallel execution

Parallel execution
• Depending on the stateful operator semantic:
• Partition input stream into keys
• Each key is processed by 1 thread
• # keys >> # threads/nodes
54

Parallel execution
Keys
domain
Agg1 Agg2 Agg3
A
D
E
B
C F
55

Parallel execution
Keys
domain
Agg1 Agg2 Agg3
A
D
E
B
C F
56

Parallel execution
• Example:
OP1 OP2 OP4 OP6OP3 OP5
Group-by
A1
Group-by
A2
57

Parallel execution
• Example:
– 90 threads available
58
Group-by
A1
Group-by
A2

Parallel execution
59
How to parallelize/distribute the query
to maximize throughput?
• Example:
– 90 threads available
Group-by
A1
Group-by
A2

• Parallelization Strategies
Parallel execution
Full query at all threads 1 operator per threads1+ operators per threads
60

Parallel execution
…
Thread 1
Thread 90
Full query at all threads
61
Group-by
A1
Group-by
A2

Parallel execution
Thread 1
Thread 90
902 902
…
Full query at all threads
62
Group-by
A1
Group-by
A2

Parallel execution
…
OP1 OP2 OP6
OP1 OP2 OP6
Thread 1
Thread 15
Thread 16
Thread 30
Thread 76
Thread 90
1 operator per thread
…
…
63

Parallel execution
OP1 OP2 OP6
OP1 OP2 OP6
Thread 1
Thread 15
Thread 16
Thread 30
Thread 76
Thread 90
15…
…
…
1 operator per thread
64

Parallel execution
Thread 1
Thread 30
Thread 31
Thread 60
Thread 61
Thread 90
…
…
…
1+ operators per thread
65
Group-by
A1
Group-by
A2

Parallel execution
…
Thread 1
Thread 30
Thread 31
Thread 60
Thread 61
Thread 90
30
…
…
1+ operators per thread
66
Group-by
A1
Group-by
A2

Parallel execution of streaming operator
• General approach, referred to as shared-nothing works for both
parallel and distributed execution
• We use threads the example, but they could be processes or even
nodes…
67
OP1 OP2
OP1 OP2
Thread 1
Thread 15
Thread 16
Thread 30
…
… OP1 OP2
OP1 OP2
Node 1
Node 15
Node 16
Node 30
…
…

Elastic execution
68
M A J F
… …

Elastic execution
69
M A J F
… …
+ +

Elastic execution
70
M A J F
… …
- -

Elastic execution
• Key for Cloud environment
• Elasticity in various forms
• Variable number of threads
• Variable number of nodes, scaling with queries
• Variable number of nodes, scaling with operators
71

Why is it complicated?
• Transfer state between nodes
• Load balancing algorithm
• Minimum parallelization unit  key
• Transfer keys between instances
72

Elastic execution
• State transfer challenging for stateful operators
A B
time
73
Tuples referring
to car A

Elastic execution
• Window Recreation Protocol
A B
time
A
A
B
Send to A
Send to B
+ No communication
between nodes
- Completion time
proportional to window
size
74

Elastic execution
• State Recreation Protocol
A B
time
B
B
B
Copy to B + Minimizes completion time
- Communication between nodes
75

Part 3 – Correctness
guarantees
76

77
OP1 OP2
Thread
OP1 OP2
Thread Thread
OP1 OP2
ProcessProcess
Thread Thread
OP1 OP2
Node Node
ProcessProcess
Thread Thread
OP1 OP2
Node Node
ProcessProcess
Thread Thread
OP1 OP2
Node Node
ProcessProcess
Thread Thread
OP1 OP2
Node Node
ProcessProcess
Thread Thread
OP1 OP2
Node Node
ProcessProcess
Thread Thread
OP1 OP2
Node Node
ProcessProcess
Thread Thread
OP1 OP2
Node Node
ProcessProcess
Thread Thread
OP1 OP2

Correct execution
• What does correct execution means
in the context of SPEs?
• Many definitions, some more formal
than others
StreamCloud: An Elastic and Scalable Data Streaming System. Vincenzo Gulisano, Ricardo Jimenez-
Peris, Marta Patiño-Martinez, Claudio Soriente, Patrick Valduriez. IEEE Transactions on Parallel and
Distributed Processing (TPDS)
Viper: A Module for Communication-Layer Determinism and Scaling in Low-Latency Stream
Processing. Ivan Walulya, Dimitris Palyvos-Giannas, Yiannis Nikolakopoulos, Vincenzo
Gulisano, Marina Papatriantafilou and Philippas Tsigas. Elsevier Future Generation Computer
Systems Journal. 2018.

Correct execution
A query that simply discards all data, always, would most likely be in
accordance with previous definitions. Would you say it’s analysis is
correct?
In general (but still in a vague form):
Correctness implies each tuple is processed according to the analysis’
semantics (according to its stateless and stateful operators) and is not
affected by execution- and implementation-related aspects

Why is it complicated?
• It depends on a mix of external / internal factors.
• We are going to look at some examples

Unsorted input streams
Counter: 4
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
Counter: 1
Counter: 2
Counter: 3
Counter: 3
time
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)
What if the tuple with
timestamp 8:45
arrives after the one
with timestamp 9:05
and we have already
produced the result?

Unsorted input streams – possible solutions
• Sort at the source, but...
• is it always possible?
• How late can a single tuple be?
• What about latency?
• Allow for late arrivals and withhold (temporary) results or produce
correction tuples, but...
• this is in turn creating disordered output streams!

Sorted input streams + distributed/parallel execution
M
M A
A
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
What if tuple with
timestamp 8:00
arrives after tuple
with timestamp 8:07?

Sorted input streams + distributed/parallel execution
– possible solutions
• merge-sort (e.g., based on the timestamp) input streams
deterministically
• What if a stream is extremely slow / missing tuples?
• What is the sorting overhead?
Merge

A general solution for order-insensitive
analysis  Watermarks
• Notice:
• If the analysis is order-insensitive, the result for a given window is correct as
long as all tuples contributing to it are taken into account. The order does not
really matter
• Keep processing tuples even if they are not in order
• Produce result for a certain window only when sure future input tuples will
no longer contribute to such window
• Key idea:
• let tuples be disordered (from the source or because of parallel execution)
• Include special tuples (watermarks) that are sorted and distinguish
timestamps before / after them

Example The notion of time is based on the
timestamps carried by tuples themselves
This specifies timestamps and watermarks
Watermarks are used to decide when the result
for a certain window can be produced

Part 4 –
One size
DOES NOT fit all
in performance
90

Phillip B. Gibbons, Keynote Talk IPDPS’15
91

Recap on stream joins
93
Data stream:
unbounded sequence of tuples
t1
t2
t3
t4
t1
t2
t3
t4
t1
t2
t3
t4
R S
Sliding
window Window
size WS
WSWR
Predicate P

Why parallel stream joins?
• WS = 600 seconds
• R receives 500 tuples/second
• S receives 500 tuples/second
• WR will contain 300,000 tuples
• WS will contain 300,000 tuples
• Each new tuple from R gets compared with all
the tuples in WS
• Each new tuple from S gets compared with all
the tuples in WR
… 300,000,000
comparisons/second!
t1
t2
t3
t4
t1
t2
t3
t4
R S
WSWR
94

Which are the challenges of a parallel stream join?
Scalability
High
throughput
Low latency
Disjoint
parallelism
Skew
resilience
Determinism
95

The 3-step procedure (sequential stream join)
For each incoming tuple t:
1. compare t with all tuples in opposite window given predicate P
2. add t to its window
3. remove stale tuples from t’s window
Add tuples to S
Add tuples to R
Prod
R
Prod
S
Consume resultsConsPU
96
We assume each
producer delivers tuples
in timestamp order

The 3-step procedure, is it enough?
Scalability
High
throughput
Low latency
Disjoint
parallelism
Skew
resilience
Determinism
97
t1
t2
t1
t2
R S
WSWR
t3
t1
t2
t1
t2
R S
WSWR
t4
t3

Enforcing determinism in sequential stream joins
• Next tuple to process = earliest(tS,tR)
• The earliest(tS,tR) tuple is referred to as the next ready tuple
• Process ready tuples in timestamp order  Determinism
PU
tS tR
98

Deterministic 3-step procedure
Pick the next ready tuple t:
1. compare t with all tuples in opposite window given predicate P
Add tuples to S
Add tuples to R
Prod
R
Prod
S
Consume resultsConsPU
99

Shared-nothing parallel stream join
(state-of-the-art)
Prod
R
Prod
S
PU1
PU2
PUN…
Cons
Add tuple to PUi S
Add tuple to PUi R
Consume results
Pick the next ready tuple t:
1. compare t with all tuples in opposite window given P
Chose a PU
Chose a PU
Take the next
ready output tuple
Scalability
High
throughput
Low latency
Disjoint
parallelism
Skew
resilience
Determinism
100
Merge

Shared-nothing parallel stream join
(state-of-the-art)
Prod
R
Prod
S
PU1
PU2
PUN
…
101
enqueue()
dequeue()
ConsMerge

From coarse-grained to fine-grained synchronization
Prod
R
Prod
S
PU1
PU2
PUN
…
Cons
102

ScaleGate
103
addTuple(tuple,sourceID)
allows a tuple from sourceID to be merged by ScaleGate in the
resulting timestamp-sorted stream of ready tuples.
getNextReadyTuple(readerID)
provides to readerID the next earliest ready tuple that has not been
yet consumed by the former.
https://github.com/dcs-chalmers/ScaleGate_Java

ScaleGate Anatomy (1)
• Inspired from
lock-free skip lists
• randomized height of nodes
• expected cost for search/insertion O(logN)
• Search by traversing from higher to lower levels
104

ScaleGate Anatomy (2)
• Reader-local view of "head"
(also has minimum ts for that reader)
• Flagging mechanism:
• If "head" is not flagged can be safely returned
• Flag the last written tuple of each source
• Nodes free to be garbage-collected after every reader
passes (almost...)
head0 head1
105

ScaleJoin
Prod
R
Prod
S
PU1
PU2
PUN
…
Cons
Add tuple SGin
Add tuple SGin
Get next ready
output tuple
from SGout
Get next ready input tuple from SGin
2. add t to its window in a round-robin fashion
106
SGin SGout
Steps for PU

107
t1
t2
R S
WR
t3
t4
R S
t4
t1
WR
R S
t4
t2
WR
R S
t4
WR
t3
Sequential stream join:
ScaleJoin with 3 PUs:
ScaleJoin (example)

ScaleJoin
Prod
R
Prod
S
PU1
PU2
PUN
…
Cons
Add tuple SGin
Add tuple SGin
Get next ready
output tuple
from SGout
108
SGin SGout
Scalability
High
throughput
Low latency
Disjoint
parallelism
Skew
resilience
Determinism
Prod
S
Prod
S
Prod
R Get next ready input tuple from SGin
2. add t to its window in a round robin fashion
Steps for PUi
OPA RM
Thread 1
OPA RM
Thread m
…
OPB RM
Thread 1
OPB RM
Thread n
…

ScaleJoin Scalability – comparisons/second
109
Number of PUs

ScaleJoin latency – milliseconds
110
Number of PUs

State of the art solution at the time
ScaleJoin was published
111

Millions of years
of evolution
Millions of
sensors
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
• ”Hard-wired” routines
• Real-time decisions
• High-throughput / low-latency
Should I (really)
have an extra
piece of cake?
Danger!!!
Run!!!
Humans
113

Years / Decades
of evolution
Millions of
sensors
What traffic
congestion
patterns can I
observe
frequently?
Don’t take
over, car in
opposite lane!
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
Databases, data mining
techniques...
Data streaming, distributed
and parallel analysis
• Continuous analysis
• Real-time decisions
• High-throughput / low-latency
Computers
(cyber-physical / IoT systems)
114

Bibliography (with some label)
1. Zhou, Jiazhen, Rose Qingyang Hu, and Yi Qian. "Scalable distributed communication architectures to
support advanced metering infrastructure in smart grid." IEEE Transactions on Parallel and Distributed
Systems 23.9 (2012): 1632-1642.
2. Gulisano, Vincenzo, et al. "BES: Differentially Private and Distributed Event Aggregation in Advanced
Metering Infrastructures." Proceedings of the 2nd ACM International Workshop on Cyber-Physical
System Security. ACM, 2016.
3. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "Online and scalable data
validation in advanced metering infrastructures." IEEE PES Innovative Smart Grid Technologies,
Europe. IEEE, 2014.
4. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "METIS: a two-tier intrusion
detection system for advanced metering infrastructures." International Conference on Security and
Privacy in Communication Systems. Springer International Publishing, 2014.
5. Yousefi, Saleh, Mahmoud Siadat Mousavi, and Mahmood Fathy. "Vehicular ad hoc networks (VANETs):
challenges and perspectives." 2006 6th International Conference on ITS Telecommunications. IEEE,
2006.
6. El Zarki, Magda, et al. "Security issues in a future vehicular network." European Wireless. Vol. 2. 2002.
7. Georgiadis, Giorgos, and Marina Papatriantafilou. "Dealing with storage without forecasts in smart
grids: Problem transformation and online scheduling algorithm." Proceedings of the 29th Annual
ACM Symposium on Applied Computing. ACM, 2014.
8. Fu, Zhang, et al. "Online temporal-spatial analysis for detection of critical events in Cyber-Physical
Systems." Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 2014.
116
AMIs
AMIs Privacy
AMIs Data validation
AMIs Intrusion detection
VNs
VNs
Smart Grid
Anomaly
detection
Smart Grid

Bibliography
9. Arasu, Arvind, et al. "Linear road: a stream data management benchmark." Proceedings of the Thirtieth
international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.
10. Lv, Yisheng, et al. "Traffic flow prediction with big data: a deep learning approach." IEEE Transactions
on Intelligent Transportation Systems 16.2 (2015): 865-873.
11. Grochocki, David, et al. "AMI threats, intrusion detection requirements and deployment
recommendations." Smart Grid Communications (SmartGridComm), 2012 IEEE Third International
Conference on. IEEE, 2012.
12. Molina-Markham, Andrés, et al. "Private memoirs of a smart meter." Proceedings of the 2nd ACM
workshop on embedded sensing systems for energy-efficiency in building. ACM, 2010.
13. Gulisano, Vincenzo, et al. "Streamcloud: A large scale data streaming system." Distributed Computing
Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 2010.
14. Stonebraker, Michael, Uǧur Çetintemel, and Stan Zdonik. "The 8 requirements of real-time stream
processing." ACM SIGMOD Record 34.4 (2005): 42-47.
15. Bonomi, Flavio, et al. "Fog computing and its role in the internet of things." Proceedings of the first
edition of the MCC workshop on Mobile cloud computing. ACM, 2012.
16. Himmelsbach, Michael, et al. "LIDAR-based 3D object perception." Proceedings of 1st international
workshop on cognition for technical systems. Vol. 1. 2008.
117
VNsBenchmark
VNs ML/NN
AMIs
AMIs
Parallel/Distr. SPE
Streaming basics
Fog architectures
Lidar sensor

Bibliography
17. Geiger, Andreas, et al. "Vision meets robotics: The KITTI dataset." The International Journal of
Robotics Research (2013): 0278364913491297.
18. Gulisano, Vincenzo Massimiliano. StreamCloud: An Elastic Parallel-Distributed Stream
Processing Engine. Diss. Informatica, 2012.
19. Cardellini, Valeria, et al. "Optimal operator placement for distributed stream processing
applications." Proceedings of the 10th ACM International Conference on Distributed and
Event-based Systems. ACM, 2016.
20. Costache, Stefania, et al. "Understanding the Data-Processing Challenges in Intelligent
Vehicular Systems." Proceedings of the 2016 IEEE Intelligent Vehicles Symposium (IV16).
21. Cormode, Graham. "The continuous distributed monitoring model." ACM SIGMOD Record
42.1 (2013): 5-14.
22. Giatrakos, Nikos, Antonios Deligiannakis, and Minos Garofalakis. "Scalable Approximate
Query Tracking over Highly Distributed Data Streams." Proceedings of the 2016 International
Conference on Management of Data. ACM, 2016.
23. Gulisano, Vincenzo, et al. "Streamcloud: An elastic and scalable data streaming system." IEEE
Transactions on Parallel and Distributed Systems 23.12 (2012): 2351-2365.
24. Shah, Mehul A., et al. "Flux: An adaptive partitioning operator for continuous query systems."
Data Engineering, 2003. Proceedings. 19th International Conference on. IEEE, 2003.
118
Dataset / Benchmark for VNs
Parallel/Distr. SPE
Scheduling / Operator placement
Streaming in VNs
Streaming basics Approximation
Approximation
Parallel/Distr. SPE
Parallel/Distr. Streaming analysis

Bibliography
25. Cederman, Daniel, et al. "Brief announcement: concurrent data structures for efficient streaming
aggregation." Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures. ACM,
2014.
26. Ji, Yuanzhen, et al. "Quality-driven processing of sliding window aggregates over out-of-order data streams."
Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems. ACM, 2015.
27. Ji, Yuanzhen, et al. "Quality-driven disorder handling for concurrent windowed stream queries with shared
operators." Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems.
ACM, 2016.
28. Gulisano, Vincenzo, et al. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join." Big
Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015.
29. Ottenwälder, Beate, et al. "MigCEP: operator migration for mobility driven distributed complex event
processing." Proceedings of the 7th ACM international conference on Distributed event-based systems. ACM,
2013.
30. De Matteis, Tiziano, and Gabriele Mencagli. "Keep calm and react with foresight: strategies for low-latency
and energy-efficient elastic data stream processing." Proceedings of the 21st ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming. ACM, 2016.
31. Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." ACM
Transactions on Database Systems (TODS) 33.1 (2008): 3.
32. Castro Fernandez, Raul, et al. "Integrating scale out and fault tolerance in stream processing using operator
state management." Proceedings of the 2013 ACM SIGMOD international conference on Management of
data. ACM, 2013.
119
Out-of-order streams
Shared-memory parallel streaming
Out-of-order streams
Shared-memory stream joins
Load balancing
Elasticity
Fault tolerance
Fault tolerance Parallel streaming

Crash course on data streaming (with examples using Apache Flink)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Crash course on data streaming (with examples using Apache Flink)

Similaire à Crash course on data streaming (with examples using Apache Flink) (20)

Dernier

Dernier (20)

Crash course on data streaming (with examples using Apache Flink)

Notes de l'éditeur