SlideShare une entreprise Scribd logo
1  sur  31
Data Streaming (in a Nutshell)...
... and Spark’s window operations
1
Vincenzo Gulisano, Ph.D.
Chalmers University
of technology
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
2
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
3
https://vincenzogulisano.com/
Assistant Professor
Distributed Computing and Systems Research Group
Department of Computer Science and engineering
Chalmers University of Technology
4
At our research team:
Research expertise & projects
Cyber
Security
Efficient
parallel &
stream
computing
Distributed
systems
IoT &Sensor
Networks
5
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
6
Motivation
• Since the year 2000, applications such as:
– Sensor networks
– Network Traffic Analysis
– Financial tickers
– Transaction Log Analysis
– Fraud Detection
• Require:
– Continuous processing of data streams
– Real Time Fashion
7
Motivation
• Relying 100% on store and process (i.e., DBs) is not feasible
– high-speed networks, nanoseconds to handle a packet
– ISP router: gigabytes of headers every hour,…
• Data Streaming:
– In memory
– Bounded resources
– Efficient one-pass analysis
8
Main Memory
Motivation
• DBMS vs. DSMS
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
Query
Data
Query
results
9
What about
?
10
Stonebraker, Michael, Uǧur Çetintemel and Stan Zdonik. The 8
requirements of real-time stream processing. (2005)
1. Keep the data moving
2. Query interface, e.g., extended SQL
3. Handle imperfections
4. Generate predictable outcomes
5. Integrate stored and streaming data
6. Guarantee data safety and availability
7. Partition and scale applications automatically
8. Process and respond instantaneously
System Model
• Data Stream: unbounded sequence of tuples
– Example: Call Description Record (CDR)
time
Field Field
Caller text
Callee text
Time (secs) int
Price (€) double
A B 8:00 3 C D 8:20 7 A E 8:35 6
11
System Model
• Operators:
OP
Stateless
1 input tuple
1 output tuple
OP
Stateful
1+ input tuple(s)
1 output tuple
12
Stateless Operators
Map: transform tuples schema
Example: convert price €  $
Filter: discard / route tuples
Example: route depending on price
Union: merge multiple streams
(sharing the same schema)
Example: merge CDRs from
different sources
System Model
13
Map
Filter
Union
…
…
Stateful Operators
Aggregate: compute aggregate
functions (group-by)
Example: compute avg. call duration
Join: match tuples from 2 streams
(equality predicate)
Example: match CDRs with prices in the
same range
System Model
14
Aggregate
Join2
System Model
• Continuous Query: graph operators/streams
Convert
€  $
Only
> 10$
Count calls
made by each
Caller number
Map Filter Agg
15
Field
Caller
Callee
Time (secs)
Price (€)
Field
Caller
Callee
Time (secs)
Price ($)
Field
Caller
Callee
Time (secs)
Price ($)
Field
Caller
Calls
Time (secs)
System Model
• Infinite sequence of tuples / bounded memory
 windows
• Example: 1 hour windows
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
16
System Model
• Infinite sequence of tuples / bounded memory
 windows
• Example: count tuples - 1 hour windows
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
17
[8:20,9:20)
What about
out-of-order tuples?
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
18
Spark’s window operations
(source: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
19
20
Spark’s window operations
(source: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
// Reduce function adding two integers, defined separately for clarity
Function2<Integer, Integer, Integer> reduceFunc = new Function2<Integer, Integer, Integer>() {
@Override public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
};
// Reduce last 30 seconds of data, every 10 seconds JavaPairDStream<String, Integer>
windowedWordCounts = pairs.reduceByKeyAndWindow(reduceFunc, Durations.seconds(30), Durations.seconds(10));
# Reduce last 30 seconds of data, every 10 seconds windowedWordCounts =
pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
21
Spark’s window operations
(source: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength,slideInterval) Return a new single-element stream, created by aggregating
elements in the stream over a sliding interval using func. The
function should be associative so that it can be computed
correctly in parallel.
reduceByKeyAndWindow(func,windowLength,
slideInterval, [numTasks])
When called on a DStream of (K, V) pairs, returns a new
DStream of (K, V) pairs where the values for each key are
aggregated using the given reduce function func over batches in a
sliding window [...]
reduceByKeyAndWindow(func, invFunc,windowLength,
slideInterval, [numTasks])
A more efficient version of the
above reduceByKeyAndWindow() where the reduce value of
each window is calculated incrementally using the reduce values
of the previous window. This is done by reducing the new data
that enters the sliding window, and “inverse reducing” the old
data that leaves the window. An example would be that of
“adding” and “subtracting” counts of keys as the window slides.
However, it is applicable only to “invertible reduce functions”
[...]
Maintaining tuples or windows?
22
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)
Maintain tuples
When the window shifts:
1. Remove contribution of stale tuples
2. Go on adding new incoming tuples
Need to maintain a
single window instance
Need to maintain all
the tuples (how many?)
Maintaining tuples or windows?
23
time
[8:00,9:00) – 3 (so far...)
8:05 8:15 8:22 8:45 9:05
[8:20,9:20) – 1 (so far...)
Maintain windows
When a tuple arrives:
1. Add its contribution to all the
windows it falls in
No need to maintain
tuples
Need to maintain all
windows to which each
tuple contributes to
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References (non exhaustive list)
24
References (non exhaustive list)
Bed time reading about Data Streaming
1. Gulisano, Vincenzo. StreamCloud: An Elastic Parallel-Distributed Stream
Processing Engine. Ph.D. Thesis. Polytechnic University Madrid, 2012.
Shared-nothing parallelism / Elasticity
1. StreamCloud: A Large Scale Data Streaming System. Vincenzo Gulisano,
Ricardo Jimenez-Peris, Marta Patiño-Martinez, Patrick Valduriez. 30th
International Conference on Distributed Computing Systems (ICDCS) 2010
2. StreamCloud: An Elastic and Scalable Data Streaming System. Vincenzo
Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente,
Patrick Valduriez. IEEE Transactions on Parallel and Distributed Processing
(TPDS)
25
References (non exhaustive list)
Shared-memory parallelism / fine-grained synchronization
1. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join. Vincenzo Gulisano, Yiannis
Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. IEEE International Conference on Big Data
(IEEE Big Data 2015)
2. DEBS Grand Challenge: Deterministic Real-Time Analytics of Geospatial Data Streams through ScaleGate
Objects. Vincenzo Gulisano, Yiannis Nikolakopoulos, Ivan Walulya, Marina Papatriantafilou, Philippas
Tsigas. The 9th ACM International Conference on Distributed Event-Based Systems (DEBS 2015)
3. Concurrent Data Structures for Efficient Streaming Aggregation (brief announcement). Daniel Cederman,
Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. The 26th Annual
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) 2014
Streaming + Security / Privacy / Cyber-physical systems
1. Understanding the Data-Processing Challenges in Intelligent Vehicular Systems. Stefania Costache, Vincenzo
Gulisano, Marina Papatriantafilou. 2016 IEEE Intelligent Vehicles Symposium (IV16)
2. BES – Differentially Private and Distributed Event Aggregation in Advanced Metering
Infrastructures. Vincenzo Gulisano, Valentin Tudor, Magnus Almgren and Marina Papatriantafilou. 2nd
ACM Cyber-Physical System Security Workshop (CPSS 2016) [held in conjunction with ACM AsiaCCS’16],
2016.
3. METIS: a Two-Tier Intrusion Detection System for Advanced Metering Infrastructures. Vincenzo Gulisano,
Magnus Almgren, Marina Papatriantafilou. 10th International Conference on Security and Privacy in
Communication Networks (SecureComm) 2014
26
References (non exhaustive list)
• Motivation / System Model
1. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues
in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, PODS ’02, New York, NY, USA, 2002. ACM.
2. Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream
processing. SIGMOD Rec., 34(4), December 2005.
3. Nesime Tatbul. QoS-Driven load shedding on data streams. In Proceedings of the Workshops XMLDM,
MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers,
EDBT ’02, London, UK, UK, 2002. Springer-Verlag.
27
References (non exhaustive list)
• Centralized Stream Processing Engines
1. Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Keith Ito, Rajeev Motwani, Utkarsh
Srivastava, and Jennifer Widom. Stream: The Stanford data stream management system. Springer, 2004.
2. Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic
foundations and query execution. The VLDB Journal, 15(2), June 2006.
3. Daniel J. Abadi, Don Carney, Uǧur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee,
Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data
stream management. The VLDB Journal, 12(2), August 2003.
4. Nesime Tatbul and Stan Zdonik. Window-aware load shedding for aggregation queries over data
streams. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06.
VLDB Endowment, 2006.
28
References (non exhaustive list)
• Distributed Stream Processing Engines
1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uǧur Çetintemel, Mitch Cherniack, Jeong-Hyon
Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and
Stanley B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277–289, 2005.
2. Magdalena Balazinska, Hari Balakrishnan, Samuel R Madden, and Michael Stonebraker. Fault-tolerance
in the borealis distributed stream processing system. ACM Trans. Database Syst., 33(1), March 2008.
ACM ID: 1331907.
3. Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Towards sensor database systems. In
Proceedings of the Second International Conference on Mobile Data Management, MDM ’01, London,
UK, UK, 2001. Springer-Verlag.
4. Jeong-hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and
Stan Zdonik. A comparison of stream-oriented high availability algorithms. Technical report, Brown CS,
2003.
5. Jeong-Hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker,
and Stan Zdonik. High-Availability algorithms for distributed stream processing. In Data Engineering,
International Conference on, volume 0, Los Alamitos, CA, USA, 2005. IEEE Computer Society.
29
References (non exhaustive list)
• Parallel Stream Processing Engines
1. Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, and Patrick Valduriez. Streamcloud:
A large scale data streaming system. In ICDCS 2010: International Conference on Distributed
Computing Systems, pages 126–137, June 2010.
2. Mehul Shah Joseph, Joseph M. Hellerstein, Sirish Ch, and Michael J. Franklin. Flux: An adaptive
partitioning operator for continuous query systems. In In ICDE, 2002.
30
References (non exhaustive list)
• Elastic Stream Processing Engines
1. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, and Patrick
Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel
and Distributed Systems, 99(PrePrints), 2012.
2. Thomas Heinze. Elastic complex event processing. In Proceedings of the 8th Middleware Doctoral
Symposium, MDS ’11, New York, NY, USA, 2011. ACM.
3. Simon Loesing, Martin Hentschel, Tim Kraska, and Donald Kossmann. Stormy: an elastic and highly
available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops,
EDBT-ICDT ’12, New York, NY, USA, 2012. ACM.
4. Scott Schneider, Henrique Andrade, Bugra Gedik, Alain Biem, and Kun-Lung Wu. Elastic scaling of
data parallel operators in stream processing. In Proceedings of the 2009 IEEE International Symposium
on Parallel&Distributed Processing, IPDPS ’09, Washington, DC, USA, 2009. IEEE Computer Society.
31

Contenu connexe

Tendances

20220201_semi dynamic STAQ application on BBMB.pptx
20220201_semi dynamic STAQ application on BBMB.pptx20220201_semi dynamic STAQ application on BBMB.pptx
20220201_semi dynamic STAQ application on BBMB.pptxLuuk Brederode
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedVasia Kalavri
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesC4Media
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Introduction to transport resilience
Introduction to transport resilienceIntroduction to transport resilience
Introduction to transport resilienceSerge Hoogendoorn
 
Capacity Planning for Linux Systems
Capacity Planning for Linux SystemsCapacity Planning for Linux Systems
Capacity Planning for Linux SystemsRodrigo Campos
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsJason Riedy
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorSun-Li Beatteay
 
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22Quantum algorithms for pattern matching in genomic sequences - 2018-06-22
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22Aritra Sarkar
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...IRJET Journal
 
ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...Paolo Missier
 
Traffic Modeling for Aggregated Periodic IoT Data
Traffic Modeling for Aggregated Periodic IoT DataTraffic Modeling for Aggregated Periodic IoT Data
Traffic Modeling for Aggregated Periodic IoT DataTobias Hoßfeld
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...Paolo Missier
 

Tendances (18)

A Brief History of Stream Processing
A Brief History of Stream ProcessingA Brief History of Stream Processing
A Brief History of Stream Processing
 
20220201_semi dynamic STAQ application on BBMB.pptx
20220201_semi dynamic STAQ application on BBMB.pptx20220201_semi dynamic STAQ application on BBMB.pptx
20220201_semi dynamic STAQ application on BBMB.pptx
 
From Trill to Quill and Beyond
From Trill to Quill and BeyondFrom Trill to Quill and Beyond
From Trill to Quill and Beyond
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
 
Lic may17
Lic may17Lic may17
Lic may17
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, Explained
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Introduction to transport resilience
Introduction to transport resilienceIntroduction to transport resilience
Introduction to transport resilience
 
Capacity Planning for Linux Systems
Capacity Planning for Linux SystemsCapacity Planning for Linux Systems
Capacity Planning for Linux Systems
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editor
 
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22Quantum algorithms for pattern matching in genomic sequences - 2018-06-22
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...
 
Traffic Modeling for Aggregated Periodic IoT Data
Traffic Modeling for Aggregated Periodic IoT DataTraffic Modeling for Aggregated Periodic IoT Data
Traffic Modeling for Aggregated Periodic IoT Data
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
 

En vedette

Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Hive Poster
Hive PosterHive Poster
Hive Posterragho
 
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichGuido Schmutz
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialNeera Agarwal
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingGyula Fóra
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King Gyula Fóra
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the EnterpriseJesus Rodriguez
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTGuido Schmutz
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
 

En vedette (20)

Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Hive Poster
Hive PosterHive Poster
Hive Poster
 
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at King
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 

Similaire à Data Streaming (in a Nutshell) ... and Spark's window operations

Data Streaming in IoT and Big Data Analytics
Data Streaming in  IoT and Big Data AnalyticsData Streaming in  IoT and Big Data Analytics
Data Streaming in IoT and Big Data AnalyticsVincenzo Gulisano
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
distributed system lab materials about ad
distributed system lab materials about addistributed system lab materials about ad
distributed system lab materials about admilkesa13
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
slides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdfslides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdfAndrea Morichetta
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
From Simulation to Online Gaming: the need for adaptive solutions
From Simulation to Online Gaming: the need for adaptive solutions From Simulation to Online Gaming: the need for adaptive solutions
From Simulation to Online Gaming: the need for adaptive solutions Gabriele D'Angelo
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series DataMongoDB
 

Similaire à Data Streaming (in a Nutshell) ... and Spark's window operations (20)

Data Streaming in IoT and Big Data Analytics
Data Streaming in  IoT and Big Data AnalyticsData Streaming in  IoT and Big Data Analytics
Data Streaming in IoT and Big Data Analytics
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
distributed system lab materials about ad
distributed system lab materials about addistributed system lab materials about ad
distributed system lab materials about ad
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Linux capacity planning
Linux capacity planningLinux capacity planning
Linux capacity planning
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Stream Processing
Stream Processing Stream Processing
Stream Processing
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
 
slides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdfslides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdf
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
From Simulation to Online Gaming: the need for adaptive solutions
From Simulation to Online Gaming: the need for adaptive solutions From Simulation to Online Gaming: the need for adaptive solutions
From Simulation to Online Gaming: the need for adaptive solutions
 
DTN
DTNDTN
DTN
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 

Dernier

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 

Dernier (20)

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 

Data Streaming (in a Nutshell) ... and Spark's window operations

  • 1. Data Streaming (in a Nutshell)... ... and Spark’s window operations 1 Vincenzo Gulisano, Ph.D. Chalmers University of technology
  • 2. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 2
  • 3. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 3
  • 4. https://vincenzogulisano.com/ Assistant Professor Distributed Computing and Systems Research Group Department of Computer Science and engineering Chalmers University of Technology 4
  • 5. At our research team: Research expertise & projects Cyber Security Efficient parallel & stream computing Distributed systems IoT &Sensor Networks 5
  • 6. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 6
  • 7. Motivation • Since the year 2000, applications such as: – Sensor networks – Network Traffic Analysis – Financial tickers – Transaction Log Analysis – Fraud Detection • Require: – Continuous processing of data streams – Real Time Fashion 7
  • 8. Motivation • Relying 100% on store and process (i.e., DBs) is not feasible – high-speed networks, nanoseconds to handle a packet – ISP router: gigabytes of headers every hour,… • Data Streaming: – In memory – Bounded resources – Efficient one-pass analysis 8
  • 9. Main Memory Motivation • DBMS vs. DSMS Disk 1 Data Query Processing 3 Query results 2 Query Main Memory Query Processing Continuous Query Data Query results 9 What about ?
  • 10. 10 Stonebraker, Michael, Uǧur Çetintemel and Stan Zdonik. The 8 requirements of real-time stream processing. (2005) 1. Keep the data moving 2. Query interface, e.g., extended SQL 3. Handle imperfections 4. Generate predictable outcomes 5. Integrate stored and streaming data 6. Guarantee data safety and availability 7. Partition and scale applications automatically 8. Process and respond instantaneously
  • 11. System Model • Data Stream: unbounded sequence of tuples – Example: Call Description Record (CDR) time Field Field Caller text Callee text Time (secs) int Price (€) double A B 8:00 3 C D 8:20 7 A E 8:35 6 11
  • 12. System Model • Operators: OP Stateless 1 input tuple 1 output tuple OP Stateful 1+ input tuple(s) 1 output tuple 12
  • 13. Stateless Operators Map: transform tuples schema Example: convert price €  $ Filter: discard / route tuples Example: route depending on price Union: merge multiple streams (sharing the same schema) Example: merge CDRs from different sources System Model 13 Map Filter Union … …
  • 14. Stateful Operators Aggregate: compute aggregate functions (group-by) Example: compute avg. call duration Join: match tuples from 2 streams (equality predicate) Example: match CDRs with prices in the same range System Model 14 Aggregate Join2
  • 15. System Model • Continuous Query: graph operators/streams Convert €  $ Only > 10$ Count calls made by each Caller number Map Filter Agg 15 Field Caller Callee Time (secs) Price (€) Field Caller Callee Time (secs) Price ($) Field Caller Callee Time (secs) Price ($) Field Caller Calls Time (secs)
  • 16. System Model • Infinite sequence of tuples / bounded memory  windows • Example: 1 hour windows time [8:00,9:00) [8:20,9:20) [8:40,9:40) 16
  • 17. System Model • Infinite sequence of tuples / bounded memory  windows • Example: count tuples - 1 hour windows time [8:00,9:00) 8:05 8:15 8:22 8:45 9:05 Output: 4 17 [8:20,9:20) What about out-of-order tuples?
  • 18. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 18
  • 19. Spark’s window operations (source: http://spark.apache.org/docs/latest/streaming-programming-guide.html) 19
  • 20. 20 Spark’s window operations (source: http://spark.apache.org/docs/latest/streaming-programming-guide.html) // Reduce function adding two integers, defined separately for clarity Function2<Integer, Integer, Integer> reduceFunc = new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }; // Reduce last 30 seconds of data, every 10 seconds JavaPairDStream<String, Integer> windowedWordCounts = pairs.reduceByKeyAndWindow(reduceFunc, Durations.seconds(30), Durations.seconds(10)); # Reduce last 30 seconds of data, every 10 seconds windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
  • 21. 21 Spark’s window operations (source: http://spark.apache.org/docs/latest/streaming-programming-guide.html) countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream. reduceByWindow(func, windowLength,slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel. reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window [...] reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions” [...]
  • 22. Maintaining tuples or windows? 22 time [8:00,9:00) 8:05 8:15 8:22 8:45 9:05 [8:20,9:20) Maintain tuples When the window shifts: 1. Remove contribution of stale tuples 2. Go on adding new incoming tuples Need to maintain a single window instance Need to maintain all the tuples (how many?)
  • 23. Maintaining tuples or windows? 23 time [8:00,9:00) – 3 (so far...) 8:05 8:15 8:22 8:45 9:05 [8:20,9:20) – 1 (so far...) Maintain windows When a tuple arrives: 1. Add its contribution to all the windows it falls in No need to maintain tuples Need to maintain all windows to which each tuple contributes to
  • 24. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References (non exhaustive list) 24
  • 25. References (non exhaustive list) Bed time reading about Data Streaming 1. Gulisano, Vincenzo. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Ph.D. Thesis. Polytechnic University Madrid, 2012. Shared-nothing parallelism / Elasticity 1. StreamCloud: A Large Scale Data Streaming System. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Patrick Valduriez. 30th International Conference on Distributed Computing Systems (ICDCS) 2010 2. StreamCloud: An Elastic and Scalable Data Streaming System. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, Patrick Valduriez. IEEE Transactions on Parallel and Distributed Processing (TPDS) 25
  • 26. References (non exhaustive list) Shared-memory parallelism / fine-grained synchronization 1. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join. Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. IEEE International Conference on Big Data (IEEE Big Data 2015) 2. DEBS Grand Challenge: Deterministic Real-Time Analytics of Geospatial Data Streams through ScaleGate Objects. Vincenzo Gulisano, Yiannis Nikolakopoulos, Ivan Walulya, Marina Papatriantafilou, Philippas Tsigas. The 9th ACM International Conference on Distributed Event-Based Systems (DEBS 2015) 3. Concurrent Data Structures for Efficient Streaming Aggregation (brief announcement). Daniel Cederman, Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. The 26th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) 2014 Streaming + Security / Privacy / Cyber-physical systems 1. Understanding the Data-Processing Challenges in Intelligent Vehicular Systems. Stefania Costache, Vincenzo Gulisano, Marina Papatriantafilou. 2016 IEEE Intelligent Vehicles Symposium (IV16) 2. BES – Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures. Vincenzo Gulisano, Valentin Tudor, Magnus Almgren and Marina Papatriantafilou. 2nd ACM Cyber-Physical System Security Workshop (CPSS 2016) [held in conjunction with ACM AsiaCCS’16], 2016. 3. METIS: a Two-Tier Intrusion Detection System for Advanced Metering Infrastructures. Vincenzo Gulisano, Magnus Almgren, Marina Papatriantafilou. 10th International Conference on Security and Privacy in Communication Networks (SecureComm) 2014 26
  • 27. References (non exhaustive list) • Motivation / System Model 1. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’02, New York, NY, USA, 2002. ACM. 2. Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec., 34(4), December 2005. 3. Nesime Tatbul. QoS-Driven load shedding on data streams. In Proceedings of the Workshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers, EDBT ’02, London, UK, UK, 2002. Springer-Verlag. 27
  • 28. References (non exhaustive list) • Centralized Stream Processing Engines 1. Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom. Stream: The Stanford data stream management system. Springer, 2004. 2. Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15(2), June 2006. 3. Daniel J. Abadi, Don Carney, Uǧur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2), August 2003. 4. Nesime Tatbul and Stan Zdonik. Window-aware load shedding for aggregation queries over data streams. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06. VLDB Endowment, 2006. 28
  • 29. References (non exhaustive list) • Distributed Stream Processing Engines 1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uǧur Çetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277–289, 2005. 2. Magdalena Balazinska, Hari Balakrishnan, Samuel R Madden, and Michael Stonebraker. Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst., 33(1), March 2008. ACM ID: 1331907. 3. Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Towards sensor database systems. In Proceedings of the Second International Conference on Mobile Data Management, MDM ’01, London, UK, UK, 2001. Springer-Verlag. 4. Jeong-hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and Stan Zdonik. A comparison of stream-oriented high availability algorithms. Technical report, Brown CS, 2003. 5. Jeong-Hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and Stan Zdonik. High-Availability algorithms for distributed stream processing. In Data Engineering, International Conference on, volume 0, Los Alamitos, CA, USA, 2005. IEEE Computer Society. 29
  • 30. References (non exhaustive list) • Parallel Stream Processing Engines 1. Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, and Patrick Valduriez. Streamcloud: A large scale data streaming system. In ICDCS 2010: International Conference on Distributed Computing Systems, pages 126–137, June 2010. 2. Mehul Shah Joseph, Joseph M. Hellerstein, Sirish Ch, and Michael J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In In ICDE, 2002. 30
  • 31. References (non exhaustive list) • Elastic Stream Processing Engines 1. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, and Patrick Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel and Distributed Systems, 99(PrePrints), 2012. 2. Thomas Heinze. Elastic complex event processing. In Proceedings of the 8th Middleware Doctoral Symposium, MDS ’11, New York, NY, USA, 2011. ACM. 3. Simon Loesing, Martin Hentschel, Tim Kraska, and Donald Kossmann. Stormy: an elastic and highly available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, EDBT-ICDT ’12, New York, NY, USA, 2012. ACM. 4. Scott Schneider, Henrique Andrade, Bugra Gedik, Alain Biem, and Kun-Lung Wu. Elastic scaling of data parallel operators in stream processing. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, IPDPS ’09, Washington, DC, USA, 2009. IEEE Computer Society. 31

Notes de l'éditeur

  1. These are the original definitions / evolved – modified over time
  2. Interesting: one or two functions?
  3. Hortoghonal thing: when to compute the final results
  4. Hortoghonal thing: when to compute the final results