SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
Building streaming
architectures
David Martinez Rego
BigD@ta Coruña 25-26 April 2016
Index
• Why?
• When?
• What?
• How?
Why?
The world does not wait
• Big data applications are build with the sole
purpose of managing a business case of gathering
an understanding about the word that would give
an advantage.
• The necessity of building streaming applications
arises from the fact that in many applications, the
value of the information gathered drops
dramatically with time.
Batch/streaming duality
• Streaming applications can bring value by giving
an approximate answer just on time. If timing is not
an issue (daily), batch pipelines can provide a
good solution.
time
value
streaming
batch
When?
Start big, grow small
• Despite the advertisement of vendors, jump in to a
streaming application is not always advisable
• It is harder to get it right and you encounter limitations:
probabilistic data structures, guarantees, …
• The value of the data you are about to gather is not
clear in a discovery phase.
• Some new libraries provide the same set of primitives
both for batch and streaming. It is possible to develop
the core of the idea and just translate that to a streaming
pipeline later.
Not always practical
• As a developer, you can
face any of the following
situations
• It is mandatory
• It is doubtful
• It will never be necessary
What?
Gathering Brokering SinkProcessing/Analysis
Execution engine
Coordination
Persistent
Persistent
~Persistent
External systems
https://www.mapr.com/developercentral/lambda-architecture
Lambda architecture
• Batch layer (ex. Spark, HDFS): process the master
dataset (append only) to precompute batch views
(a view the front end will query
• Speed layer (streaming): calculate ephemeral
views only based on recent data!
• Motto: take into account reprocessing and
recovery
Lambda architecture
• Problems:!
• Maintaing two code bases in sync (often different
because speed layer cannot reproduce the
same)
• Synchronisation of the two layers in the query
layer is an additional problem
Gathering Brokering
Reservoir of Master Data
Production
Production
Catching up…
Production pipeline
New pipeline
Gathering Brokering
Reservoir of Master Data
Production
Done!
Old pipeline
Production pipeline
Kappa approach
• Only maintain one code base and reduce
accidental complexity by using too many
technologies.
• Can reverse back if something goes wrong
• Not a silver bullet and not prescription of
technologies, just a framework.
Gathering Brokering SinkProcessing/Analysis
Execution engine
Coordination
Persistent
Persistent
~Persistent
External systems
How?
Concepts are basic
• There are multiple frameworks
available nowadays who
change terminology trying to
differentiate.
• It makes starting on streaming
a bit confusing…
Concepts are basic
• It makes starting on streaming
a bit confusing…
• Actually there are many
concepts which are shared
between them and they are
quite logical.
Step 1: data structure
• The basic data structure is made of 4 elements
• Sink: where is this thing going?
• Partition key: to which shard?
• Sequence id: when was this produced?
• Data: anything that can be serialised (JSON, Avro, photo, …)
Partition key Sequence idSink Data
( , , ),
Step 2: hashing
• The holy grail trick of big data to split the work, and
also major block of streaming
• We use hashing in the reverse of classical, force
the clashing of the things that are if my interest
Partition key Sequence idSink Data
( , , ),
h(k) mod N
Step 3: fault tolerance
“Distributed computing is parallel computing when
you cannot trust anything or anyone”
Step 3: fault tolerance
• At any point any node producing the data in the
source can stop working
• Non persistent: data is lost
• Persistent: data is replicated so it can always be
recovered from other node
Step 3: fault tolerance
• At any point any node computing our pipeline can
go down
• at most once: we let data be lost, once delivered
do not reprocess.
• at least once, we ensure delivery, can be
reprocessed.
• exactly once, we ensure delivery and no
reprocessing
Step 3: fault tolerance
• At any point any node computing our pipeline can go
down
• checkpointing: If we have been running the pipeline for
hours and something goes wrong, do I have to start
from the beginning?
• Streaming systems put in place mechanisms to
checkpoint progress so the new worker knows
previous state and where to start from.
• Usually involves other systems to save checkpoints
and synchronise.
Step 4: delivery
• One at a time: we process each message
individually. Increases response time per message.
• Micro-batch: we always process data in batches
gathered at specified time intervals or size. Makes
it impossible to reduce message processing below
a limit.
Gathering Brokering SinkProcessing/Analysis
Execution engine
Coordination
Persistent
Persistent
~Persistent
External systems
Gathering
Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
h(k) mod N
h(k) mod N
h(k) mod N
h(k) mod N
…
Consumer 1
Consumer 1
Zookeeper
…
…
…
Consumer 2
Consumer 1
Zookeeper
…
…
Consumer 2 …
Consumer 1 …
Produce to
Kafka, consume
from Kafka
Gathering Brokering SinkProcessing/Analysis
Execution engine
Coordination
Persistent
Persistent
~Persistent
External systems
one!
at-a-time
mini!
batch
exactly!
once
Deploy Windowing Functional Catch
Yes Yes * Yes *
Custom
YARN
Yes * ~ DRPC
No Yes Yes
YARN
Mesos
Yes Yes
MLlib,
ecosystem
Yes Yes Yes YARN Yes Yes
Flexible
windowing
Yes ~ No YARN ~ No
DB update
log plugin
Yes Yes Yes Google Yes ~
Google
ecosystem
Yes you No AWS you No
AWS
ecosystem
* with Trident
Flink basic concepts
• Stream: source of data that feeds computations (a
batch dataset is a bounded stream)
• Transformations: operation that takes one or more
streams as input and computes an output stream.
They can be stateless of stateful (exactly once).
• Sink: endpoint that received the output stream of a
transformation
• Dataflow: DAG of streams, transformations and sinks.
Flink basic concepts
Flink basic concepts
Samza basic concepts
• Streams: persistent set of immutable messages of similar type
and category with transactional nature.
• Jobs: code that performs logical transformations on a set of
input streams to append to a set of output streams.
• Partitions: Each stream breaks into partitions, set of totally
ordered sequence of examples.
• Tasks: Each task consumes data from one partition.
• Dataflow: composition of jobs that connects a set of streams.
• Containers: physical unit of parallelism.
Samza basic concepts
Storm basic concepts
• Spout: source of data from any external system.
• Bolts: transformations of one or more streams into another
set of output streams.
• Stream grouping: shuffling of streaming data between bolts.
• Topology: set of spouts and bolts that process a stream of
data.
• Tasks and Workers: unit of work deployable into one
container. Workers can process one or more tasks. Task
deploy to one worker.
Storm basic concepts
Trident Topology
Compile
Spark basic concepts
• DStream: continuous stream of data represented by a
series of RDDs. Each RDD contains data for a specific
time interval.
• Input DStream and Receiver: source of data that feeds a
DStream.
• Transformations: operations that transform one DStream
in another DStream (stateless and stateful with exactly
once semantics).
• Output operations: operations that periodically push data
of a DStream to a specific output system.
Spark basic concepts
Conclusions…
• Think on streaming when there is a hard constraint on time-to-information
• Use a queue system as your place of orchestration
• Select the processing system that best suits to your use case
• Samza: early stage, more to come in the close future.
• Spark: good option if mini batch will always work for you.
• Storm: good option if you can setup the infrastructure. DRPC provides an interesting pattern
for some use cases.
• Flink: reduced ecosystem because it has a shorter history. Its design learnt from all past
frameworks and is the most flexible.
• Datastream: original inspiration for Flink. Good and flexible model if you want to go the
managed route and make use of Google toolbox (Bigtable, etc)
• Kinesis: Only if you have some legacy. Probably better off using Spark connector in AWS
EMR.
Where to go…
• All code examples are available in Github
• Kafka https://github.com/torito1984/kafka-
playground.git, https://github.com/torito1984/kafka-
doyle-generator.git
• Spark https://github.com/torito1984/spark-doyle.git!
• Storm https://github.com/torito1984/trident-doyle.git!
• Flink https://github.com/torito1984/flink-sherlock.git!
• Samza https://github.com/torito1984/samza-locations.git
Building streaming
architectures
David Martinez Rego
BigD@ta Coruña 25-26 April 2016

Contenu connexe

Tendances

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...confluent
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...Grier Johnson
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Monitoring and Troubleshooting a Real Time Pipeline
Monitoring and Troubleshooting a Real Time PipelineMonitoring and Troubleshooting a Real Time Pipeline
Monitoring and Troubleshooting a Real Time PipelineApache Apex
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataAdaryl "Bob" Wakefield, MBA
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology confluent
 

Tendances (20)

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Monitoring and Troubleshooting a Real Time Pipeline
Monitoring and Troubleshooting a Real Time PipelineMonitoring and Troubleshooting a Real Time Pipeline
Monitoring and Troubleshooting a Real Time Pipeline
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 

En vedette

Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichGuido Schmutz
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialNeera Agarwal
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingGyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King Gyula Fóra
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsVincenzo Gulisano
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the EnterpriseJesus Rodriguez
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTGuido Schmutz
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...Audrey Neveu
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingGuido Schmutz
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Web Services
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 

En vedette (20)

Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at King
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operations
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 

Similaire à Building Big Data Streaming Architectures

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...confluent
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and NanosecondsRobert Greiner
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropeFlip Kromer
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopAyon Sinha
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble
 

Similaire à Building Big Data Streaming Architectures (20)

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and Nanoseconds
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOL
 
Introduction
IntroductionIntroduction
Introduction
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 

Dernier

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 

Dernier (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 

Building Big Data Streaming Architectures

  • 1. Building streaming architectures David Martinez Rego BigD@ta Coruña 25-26 April 2016
  • 4. The world does not wait • Big data applications are build with the sole purpose of managing a business case of gathering an understanding about the word that would give an advantage. • The necessity of building streaming applications arises from the fact that in many applications, the value of the information gathered drops dramatically with time.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Batch/streaming duality • Streaming applications can bring value by giving an approximate answer just on time. If timing is not an issue (daily), batch pipelines can provide a good solution. time value streaming batch
  • 10. When?
  • 11. Start big, grow small • Despite the advertisement of vendors, jump in to a streaming application is not always advisable • It is harder to get it right and you encounter limitations: probabilistic data structures, guarantees, … • The value of the data you are about to gather is not clear in a discovery phase. • Some new libraries provide the same set of primitives both for batch and streaming. It is possible to develop the core of the idea and just translate that to a streaming pipeline later.
  • 12. Not always practical • As a developer, you can face any of the following situations • It is mandatory • It is doubtful • It will never be necessary
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. What?
  • 18. Gathering Brokering SinkProcessing/Analysis Execution engine Coordination Persistent Persistent ~Persistent External systems
  • 20. Lambda architecture • Batch layer (ex. Spark, HDFS): process the master dataset (append only) to precompute batch views (a view the front end will query • Speed layer (streaming): calculate ephemeral views only based on recent data! • Motto: take into account reprocessing and recovery
  • 21. Lambda architecture • Problems:! • Maintaing two code bases in sync (often different because speed layer cannot reproduce the same) • Synchronisation of the two layers in the query layer is an additional problem
  • 22. Gathering Brokering Reservoir of Master Data Production Production Catching up… Production pipeline New pipeline
  • 23. Gathering Brokering Reservoir of Master Data Production Done! Old pipeline Production pipeline
  • 24. Kappa approach • Only maintain one code base and reduce accidental complexity by using too many technologies. • Can reverse back if something goes wrong • Not a silver bullet and not prescription of technologies, just a framework.
  • 25. Gathering Brokering SinkProcessing/Analysis Execution engine Coordination Persistent Persistent ~Persistent External systems
  • 26. How?
  • 27. Concepts are basic • There are multiple frameworks available nowadays who change terminology trying to differentiate. • It makes starting on streaming a bit confusing…
  • 28. Concepts are basic • It makes starting on streaming a bit confusing… • Actually there are many concepts which are shared between them and they are quite logical.
  • 29. Step 1: data structure • The basic data structure is made of 4 elements • Sink: where is this thing going? • Partition key: to which shard? • Sequence id: when was this produced? • Data: anything that can be serialised (JSON, Avro, photo, …) Partition key Sequence idSink Data ( , , ),
  • 30. Step 2: hashing • The holy grail trick of big data to split the work, and also major block of streaming • We use hashing in the reverse of classical, force the clashing of the things that are if my interest Partition key Sequence idSink Data ( , , ), h(k) mod N
  • 31. Step 3: fault tolerance “Distributed computing is parallel computing when you cannot trust anything or anyone”
  • 32. Step 3: fault tolerance • At any point any node producing the data in the source can stop working • Non persistent: data is lost • Persistent: data is replicated so it can always be recovered from other node
  • 33. Step 3: fault tolerance • At any point any node computing our pipeline can go down • at most once: we let data be lost, once delivered do not reprocess. • at least once, we ensure delivery, can be reprocessed. • exactly once, we ensure delivery and no reprocessing
  • 34. Step 3: fault tolerance • At any point any node computing our pipeline can go down • checkpointing: If we have been running the pipeline for hours and something goes wrong, do I have to start from the beginning? • Streaming systems put in place mechanisms to checkpoint progress so the new worker knows previous state and where to start from. • Usually involves other systems to save checkpoints and synchronise.
  • 35. Step 4: delivery • One at a time: we process each message individually. Increases response time per message. • Micro-batch: we always process data in batches gathered at specified time intervals or size. Makes it impossible to reduce message processing below a limit.
  • 36. Gathering Brokering SinkProcessing/Analysis Execution engine Coordination Persistent Persistent ~Persistent External systems
  • 38. Partition keyTopic Data , , )( … Partition keyTopic Data , , )( Partition keyTopic Data , , )( … Partition keyTopic Data , , )(
  • 39. Partition keyTopic Data , , )( … Partition keyTopic Data , , )( Partition keyTopic Data , , )( … Partition keyTopic Data , , )( h(k) mod N h(k) mod N h(k) mod N h(k) mod N
  • 43. Gathering Brokering SinkProcessing/Analysis Execution engine Coordination Persistent Persistent ~Persistent External systems
  • 44. one! at-a-time mini! batch exactly! once Deploy Windowing Functional Catch Yes Yes * Yes * Custom YARN Yes * ~ DRPC No Yes Yes YARN Mesos Yes Yes MLlib, ecosystem Yes Yes Yes YARN Yes Yes Flexible windowing Yes ~ No YARN ~ No DB update log plugin Yes Yes Yes Google Yes ~ Google ecosystem Yes you No AWS you No AWS ecosystem * with Trident
  • 45. Flink basic concepts • Stream: source of data that feeds computations (a batch dataset is a bounded stream) • Transformations: operation that takes one or more streams as input and computes an output stream. They can be stateless of stateful (exactly once). • Sink: endpoint that received the output stream of a transformation • Dataflow: DAG of streams, transformations and sinks.
  • 48.
  • 49. Samza basic concepts • Streams: persistent set of immutable messages of similar type and category with transactional nature. • Jobs: code that performs logical transformations on a set of input streams to append to a set of output streams. • Partitions: Each stream breaks into partitions, set of totally ordered sequence of examples. • Tasks: Each task consumes data from one partition. • Dataflow: composition of jobs that connects a set of streams. • Containers: physical unit of parallelism.
  • 51.
  • 52. Storm basic concepts • Spout: source of data from any external system. • Bolts: transformations of one or more streams into another set of output streams. • Stream grouping: shuffling of streaming data between bolts. • Topology: set of spouts and bolts that process a stream of data. • Tasks and Workers: unit of work deployable into one container. Workers can process one or more tasks. Task deploy to one worker.
  • 53. Storm basic concepts Trident Topology Compile
  • 54.
  • 55. Spark basic concepts • DStream: continuous stream of data represented by a series of RDDs. Each RDD contains data for a specific time interval. • Input DStream and Receiver: source of data that feeds a DStream. • Transformations: operations that transform one DStream in another DStream (stateless and stateful with exactly once semantics). • Output operations: operations that periodically push data of a DStream to a specific output system.
  • 57.
  • 58. Conclusions… • Think on streaming when there is a hard constraint on time-to-information • Use a queue system as your place of orchestration • Select the processing system that best suits to your use case • Samza: early stage, more to come in the close future. • Spark: good option if mini batch will always work for you. • Storm: good option if you can setup the infrastructure. DRPC provides an interesting pattern for some use cases. • Flink: reduced ecosystem because it has a shorter history. Its design learnt from all past frameworks and is the most flexible. • Datastream: original inspiration for Flink. Good and flexible model if you want to go the managed route and make use of Google toolbox (Bigtable, etc) • Kinesis: Only if you have some legacy. Probably better off using Spark connector in AWS EMR.
  • 59. Where to go… • All code examples are available in Github • Kafka https://github.com/torito1984/kafka- playground.git, https://github.com/torito1984/kafka- doyle-generator.git • Spark https://github.com/torito1984/spark-doyle.git! • Storm https://github.com/torito1984/trident-doyle.git! • Flink https://github.com/torito1984/flink-sherlock.git! • Samza https://github.com/torito1984/samza-locations.git
  • 60. Building streaming architectures David Martinez Rego BigD@ta Coruña 25-26 April 2016