Contenu connexe Similaire à Stream Processing and Real-Time Data Pipelines (20) Stream Processing and Real-Time Data Pipelines1. © 2017 Hazelcast Inc.
Stream processing
and real-time data pipelines
1
Vladimir Schreiner | vladimir@hazelcast.com | CS HUG Prague | 29th June 2018
2
2. © 2017 Hazelcast Inc.
What is Hazelcast?
HAZELCAST IMDG is a Distributed
In-Memory Store
Cache, KV-Store, Messaging
Implements standard collections API
in distributed way
Clients in Java, .NET, NodeJS, Go,
Pythoon, C++
From 2008
HAZELCAST JET is a general
purpose distributed data
processing engine built on
Hazelcast and based on directed
acyclic graphs (DAGs) to model data
flow for low latency batch and stream
processing.
From 2015
4. © 2017 Hazelcast Inc.
What is stream processing?
Data Processing: Massage the data when moving
from place to place.
On-Line systems – request/response, small volumes, low-latency
Batch Processing – data in / data out, big volumes, huge latency
Stream processing – data in / data out, big volumes, low-latency
7. © 2017 Hazelcast Inc.
When to Use Stream Processing
• Real-time analytics
• Monitoring, Fraud, Anomalies, Pattern detection,
Prediction
• Event-Driven Architectures
• Real-Time ETL
• Moving batch tasks to near real-time
• Continuous data
• Consistent Resource Consumption (1GB/sec -> 86TB/day)
8. © 2017 Hazelcast Inc.
Streaming is nothing brand new
Complex Event Processing – early 2000‘s
Materialised Views – 1998 (Oracle 8i)
UNIX Pipelines?
Scale is what‘s new!
9. © 2017 Hazelcast Inc.
What modern SPE can do for you?
• Offers high-level API to implement the processing pipeline
• map, filter, groupBy, aggregate, join …
• Offers connectors to read and write the data
• Kafka, HDFS, JDBC, JMS …
• You implement the data pipeline and submit it to the SPE
• Executes the pipeline in a parallel and distributed environment
• Moves the data through the pipeline (partitioning, shuffling, backpressure)
• Is fault-tolerant (survives failures) and elastic
• Monitoring, diagnostics etc.
10. © 2017 Hazelcast Inc.
England – Belgium
England amazing! #ENGBEL ENG win +1, BEL lost +1
Belgium hopeless, it‘s so bad #ENGBEL ENG win +2, BEL lost + 2
My guess is draw for England - Belgium #ENGBEL Both draw + 1
12. © 2017 Hazelcast Inc.
1st Gen: MapReduce
• Google File System Paper, 2003
https://static.googleusercontent.com/media/research.google.com/en//archi
ve/gfs-sosp2003.pdf
• MapReduce paper, 2004
https://static.googleusercontent.com/media/research.google.com/en//archi
ve/mapreduce-osdi04.pdf
• Apache Hadoop founded by Doug Cutting and Mike Cafarella at Yahoo
2006
• Commercial Open Source Distros: Cloudera, Hortonworks, MapR
• Lots of additions to the ecosystem
13. © 2017 Hazelcast Inc.
• Apache Spark started as a research project at UC Berkeley in
the AMPLab, which focuses on big data analytics, in 2010.
• Goal was to design a programming model that supports a much wider
class of applications than MapReduce. Introduced DAG
2nd Gen: Spark
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
14. © 2017 Hazelcast Inc.
• Fault-Tolerance of Spark designed with for batch
• Spark Streaming Paper, 2012. Stream as a sequence of micro-batches
• Spark has now moved on to DataFrames, Tungsten and Spark Streaming
and its architecture continues to evolve so it is also 3rd Gen (Continuous).
2nd Gen: Spark Streaming
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
15. © 2017 Hazelcast Inc.
3rd Gen: Continuous streaming
• DAG based
• Streaming based. Not micro-batch
• Batch is a simply streaming with bounds
• Learns from previous systems
• Informed by academic papers in the last decade, such as the
Google[1][2] but many more
• Plethora of Choice: Apache Storm, Twitter Heron, Apache Flink,
Kafka Streams, Google DataFlow, Hazelcast Jet, Spark Continuous
Streaming
[1] MillWheel: Fault-Tolerant Stream Processing at Internet Scale
[2] FlumeJava: Easy, Efficient Data-Parallel Pipelines
18. © 2017 Hazelcast Inc.
Hadoop – Obsolete?
https://www.gartner.com/newsroom/id/3809163
22. © 2017 Hazelcast Inc.
1 - Infinite input
• Some operations need bounded data (aggregate, join)
• Slowest response within last 10 secs x Slowest response
since the app was started
• Windows - a bounded view on top of an infinite stream.
Scope of a computation defined in code. Not in the data
(batch) or by operational settings (micro batch for
streaming)!
23. © 2017 Hazelcast Inc.
Tumbling windows
• Continuous stream divided into discrete parts.
• Defined by a time duration or record count.
24. © 2017 Hazelcast Inc.
Sliding windows
• Fixed-size view that slides with the time advancing.
• Fixed-size - they can overlap.
• Defined by a size (time duration, record count) and shift
25. © 2017 Hazelcast Inc.
Session windows
• Burst of activity followed by period of inactivity (timeout).
• Collects the activity belonging to the same session.
• It’s dynamic - no fixed start or duration.
26. © 2017 Hazelcast Inc.
2 – Late Events
• Wall clock time x Event time
• Event time – more natural, unordered data
• Long or short waiting? Latency, memory X correctnes.
• Heuristic helping you assessing window completenes
27. © 2017 Hazelcast Inc.
3 - Fault Tolerance
• Streaming jobs can be running for a long time! What if
something crashes?
• System back-ups!
• It’s not that simple: you still need to coordinate the states of
different steps in the computation: Distributed snapshots
• Chandy-Lamport distributed snapshotting algorithm[1]
[1] - https://dl.acm.org/citation.cfm?id=214456
28. © 2017 Hazelcast Inc.
Distributed snapshots
Reader Writer
Reader
Reader
Reader
Writer
Snapshot Done!
29. © 2017 Hazelcast Inc.
4 - Complexity
There is a lot of moving parts:
• Replayable data source (2x Kafka + 3x Zookeeper)
• Stream Processing Engine (2x)
• Cluster manager (YARN, Mesos, Kubernetes)
• Highly-Available Snapshot Storage (HDFS) (3x)
• Data sink for results
You Are Not Google, are you?
30. © 2017 Hazelcast Inc.
Jet Favours Speed
*
* Spark had all performance options turn on including Tungsten https://hazelcast.com/resources/jet-0-4-streaming-benchmark/
32. © 2017 Hazelcast Inc.
Jet Application Deployment Options
IMDG IMDG
• No separate process to manage
• Great for microservices
• Great for OEM
• Simplest for Ops – nothing extra
Embedded
Application
Java API
Application
Java API
Application
Java API
Java Client
Application
Client-Server
Jet
Jet Jet
Jet
Jet Jet
Java Client
Application
Java Client
Application
Java Client
Application
• Separate Jet Cluster
• Scale Jet independent of applications
• Isolate Jet from application server lifecycle
• Managed by Ops
34. © 2017 Hazelcast Inc.
Flight Telemetry
https://github.com/hazelcast/hazelcast-jet-demos
© 2017 Hazelcast Inc.
Flight Telemetry
https://github.com/hazelcast/hazelcast-jet-demos
35. © 2017 Hazelcast Inc.
Demo Applications
Real-time Image Recognition Twitter Cryptocurrency Sentiment Analysis
Recognizes images present in the webcam video
input with a model trained with
CIFAR-10 dataset.
Twitter content is analyzed in real time to calculate
cryptocurrency trend list with popularity index.
Real-Time Road Traffic Analysis
And Prediction
Real-time Sports Betting Engine
Continuously computes linear regression models
from current traffic. Uses the trend from week ago
to predict traffic now
This is a simple example of a sports book and is a
good introduction to the Pipeline API.
It also uses Hazelcast IMDG as an in-memory data
store.
Flight Telemetry Market Data Ingest
Reads a stream of telemetry data from ADB-S on all
commercial aircraft flying anywhere in the world.
There is typically 5,000 - 6,000 aircraft at any point
in time. This is then filtered, aggregated and certain
features are enriched and displayed in Grafana.
Uploads a stream of stock market data (prices) from
a Kafka topic into an IMDG map. Data is analyzed as
part of the upload process, calculating the moving
averages to detect buy/sell indicators. Input data
here is manufactured to ensure such indicators
exist, but this is easy to reconnect to real input.
Markov Chain Generator Real-Time Trade Processing
Generates a Markov Chain with probabilities based
on supplied classical books.
Processes immutable events from an event bus
(Kafka) to update storage optimized for querying
and reading (IMDG).
36. © 2017 Hazelcast Inc.
Hazelcast Jet. Try it!
36
• High Performance | Industry Leading Performance
• Works great with Hazelcast IMDG | Source, Sink, Enrichment
• Very simple to program | Leverages existing standards
• Very simple to deploy | embed 10MB jar or Client Server
• Works in every Cloud | Same as Hazelcast IMDG
• For Developers by Developers | Code it
37. © 2017 Hazelcast Inc.
Questions?
Version 0.6.1 is the current release with 0.7 coming in September
aiming for 1.0 this year
http://jet.hazelcast.org
https://github.com/hazelcast/hazelcast-jet-demos/