4. The world does not wait
• Big data applications are build with the sole
purpose of managing a business case of gathering
an understanding about the word that would give
an advantage.
• The necessity of building streaming applications
arises from the fact that in many applications, the
value of the information gathered drops
dramatically with time.
5.
6.
7.
8.
9. Batch/streaming duality
• Streaming applications can bring value by giving
an approximate answer just on time. If timing is not
an issue (daily), batch pipelines can provide a
good solution.
time
value
streaming
batch
11. Start big, grow small
• Despite the advertisement of vendors, jump in to a
streaming application is not always advisable
• It is harder to get it right and you encounter limitations:
probabilistic data structures, guarantees, …
• The value of the data you are about to gather is not
clear in a discovery phase.
• Some new libraries provide the same set of primitives
both for batch and streaming. It is possible to develop
the core of the idea and just translate that to a streaming
pipeline later.
12. Not always practical
• As a developer, you can
face any of the following
situations
• It is mandatory
• It is doubtful
• It will never be necessary
20. Lambda architecture
• Batch layer (ex. Spark, HDFS): process the master
dataset (append only) to precompute batch views
(a view the front end will query
• Speed layer (streaming): calculate ephemeral
views only based on recent data!
• Motto: take into account reprocessing and
recovery
21. Lambda architecture
• Problems:!
• Maintaing two code bases in sync (often different
because speed layer cannot reproduce the
same)
• Synchronisation of the two layers in the query
layer is an additional problem
24. Kappa approach
• Only maintain one code base and reduce
accidental complexity by using too many
technologies.
• Can reverse back if something goes wrong
• Not a silver bullet and not prescription of
technologies, just a framework.
27. Concepts are basic
• There are multiple frameworks
available nowadays who
change terminology trying to
differentiate.
• It makes starting on streaming
a bit confusing…
28. Concepts are basic
• It makes starting on streaming
a bit confusing…
• Actually there are many
concepts which are shared
between them and they are
quite logical.
29. Step 1: data structure
• The basic data structure is made of 4 elements
• Sink: where is this thing going?
• Partition key: to which shard?
• Sequence id: when was this produced?
• Data: anything that can be serialised (JSON, Avro, photo, …)
Partition key Sequence idSink Data
( , , ),
30. Step 2: hashing
• The holy grail trick of big data to split the work, and
also major block of streaming
• We use hashing in the reverse of classical, force
the clashing of the things that are if my interest
Partition key Sequence idSink Data
( , , ),
h(k) mod N
31. Step 3: fault tolerance
“Distributed computing is parallel computing when
you cannot trust anything or anyone”
32. Step 3: fault tolerance
• At any point any node producing the data in the
source can stop working
• Non persistent: data is lost
• Persistent: data is replicated so it can always be
recovered from other node
33. Step 3: fault tolerance
• At any point any node computing our pipeline can
go down
• at most once: we let data be lost, once delivered
do not reprocess.
• at least once, we ensure delivery, can be
reprocessed.
• exactly once, we ensure delivery and no
reprocessing
34. Step 3: fault tolerance
• At any point any node computing our pipeline can go
down
• checkpointing: If we have been running the pipeline for
hours and something goes wrong, do I have to start
from the beginning?
• Streaming systems put in place mechanisms to
checkpoint progress so the new worker knows
previous state and where to start from.
• Usually involves other systems to save checkpoints
and synchronise.
35. Step 4: delivery
• One at a time: we process each message
individually. Increases response time per message.
• Micro-batch: we always process data in batches
gathered at specified time intervals or size. Makes
it impossible to reduce message processing below
a limit.
38. Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
39. Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
Partition keyTopic Data
, , )(
…
Partition keyTopic Data
, , )(
h(k) mod N
h(k) mod N
h(k) mod N
h(k) mod N
44. one!
at-a-time
mini!
batch
exactly!
once
Deploy Windowing Functional Catch
Yes Yes * Yes *
Custom
YARN
Yes * ~ DRPC
No Yes Yes
YARN
Mesos
Yes Yes
MLlib,
ecosystem
Yes Yes Yes YARN Yes Yes
Flexible
windowing
Yes ~ No YARN ~ No
DB update
log plugin
Yes Yes Yes Google Yes ~
Google
ecosystem
Yes you No AWS you No
AWS
ecosystem
* with Trident
45. Flink basic concepts
• Stream: source of data that feeds computations (a
batch dataset is a bounded stream)
• Transformations: operation that takes one or more
streams as input and computes an output stream.
They can be stateless of stateful (exactly once).
• Sink: endpoint that received the output stream of a
transformation
• Dataflow: DAG of streams, transformations and sinks.
49. Samza basic concepts
• Streams: persistent set of immutable messages of similar type
and category with transactional nature.
• Jobs: code that performs logical transformations on a set of
input streams to append to a set of output streams.
• Partitions: Each stream breaks into partitions, set of totally
ordered sequence of examples.
• Tasks: Each task consumes data from one partition.
• Dataflow: composition of jobs that connects a set of streams.
• Containers: physical unit of parallelism.
52. Storm basic concepts
• Spout: source of data from any external system.
• Bolts: transformations of one or more streams into another
set of output streams.
• Stream grouping: shuffling of streaming data between bolts.
• Topology: set of spouts and bolts that process a stream of
data.
• Tasks and Workers: unit of work deployable into one
container. Workers can process one or more tasks. Task
deploy to one worker.
55. Spark basic concepts
• DStream: continuous stream of data represented by a
series of RDDs. Each RDD contains data for a specific
time interval.
• Input DStream and Receiver: source of data that feeds a
DStream.
• Transformations: operations that transform one DStream
in another DStream (stateless and stateful with exactly
once semantics).
• Output operations: operations that periodically push data
of a DStream to a specific output system.
58. Conclusions…
• Think on streaming when there is a hard constraint on time-to-information
• Use a queue system as your place of orchestration
• Select the processing system that best suits to your use case
• Samza: early stage, more to come in the close future.
• Spark: good option if mini batch will always work for you.
• Storm: good option if you can setup the infrastructure. DRPC provides an interesting pattern
for some use cases.
• Flink: reduced ecosystem because it has a shorter history. Its design learnt from all past
frameworks and is the most flexible.
• Datastream: original inspiration for Flink. Good and flexible model if you want to go the
managed route and make use of Google toolbox (Bigtable, etc)
• Kinesis: Only if you have some legacy. Probably better off using Spark connector in AWS
EMR.
59. Where to go…
• All code examples are available in Github
• Kafka https://github.com/torito1984/kafka-
playground.git, https://github.com/torito1984/kafka-
doyle-generator.git
• Spark https://github.com/torito1984/spark-doyle.git!
• Storm https://github.com/torito1984/trident-doyle.git!
• Flink https://github.com/torito1984/flink-sherlock.git!
• Samza https://github.com/torito1984/samza-locations.git