Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.
7. The calm before the storm
Batch Processing (High Latency, inability to reason about
time)
Coupled systems prevented fast delivery of single change
requirements
Processing large distributed data
Messaging incorporated business logic (Service Bus)
Customers demanded immediate insight/action
Event Ordering/Timing, Consistency, Data Lineage
Lack of Fault Tolerant Systems
11. Ref: Michael Hammer - Harvard Business Review 1990
“We cannot achieve breakthroughs in
performance by cutting fat or automating
existing processes. Rather, we must
challenge old assumptions and shed the
old rules that made the business
underperform in the first place.”
12. Ref: Michael Hammer - Harvard Business Review 1990
“These rules of work design are based on
assumptions about technology, people,
and organisational goals that no longer
hold”
19. flowing from a to a
Any event that happens internal or external to your
company is fair game for inclusion in a stream!
WHAT ARE STREAMS?
Unbounded Events Producer Consumer
21. When did you last drop a DVD back to your video store ?
Convenience of streaming films won out
22. Anyone using Dublin Bus still carry a timetable?
Realtime with Context is needed...
23. SOME OTHER COMMON STREAM EXAMPLES
Log files
User website clicks,
Finance stocks
Social media streams
24. Ideal Stream Charactristics
Low Latency (Time required to produce some result)
High Throughput (Number of results produced in time)
Persisted for reuse
Fault Tolerant
Scalable Event Production (i.e. Partitioning)
Scaleable Event Consumption (i.e. Consumer Groups)
Consumer manages state (offsets)
Handle Back Pressure
25. Benefits of streams
Ability to augment and enrich data streams
Duality of Streams and Tables (Only Streams Work)
Replay from define offset
Stream outputs can become stream inputs (unix pipes!)
Data first - Processing Later (Fast feature creation)
Stream your monitoring (Logs, Ops Metrics, Business KPI
etc.)
26. Benefits of streams contd.
Location in Time Testing (Bugs In Code)
Replication for Scale
Cross/Join prior unrelated sources (i.e. Time, Context -
Analytics)
Point of Record Stream (produce suitable Materialized
Views)
27. MOST POPULAR STREAMING TOOLS
Apache Kafka
Amazon Kinesis - Based on Kafka Ideas
MapR Streams - Uses Kafka API (adds resilience features)
33. 8 Requirements of a Real-Time Stream Processing Engine
(Michael Stonebraker)
1. Keep the data moving
2. Query using SQL on Stream
3. Handle Stream Imperfections (Delayed, Missing, Out-Of-
Order Data)
4. Generate Predictable Outcomes
5. Integrate Stored and Streaming Data
6. Guarantee Data Safety and Availabilty
7. Partition and Scale Applications Automatically
8. Process and Respond Instantaneously
35. Stream Processing Engine - Use Cases
Lineage, Auditing, History (Immutable)
Internet of Things (Sensor data)
Realtime Monitoring (Failure Prevention)
Autonomous Cars
Fraud/Anomoly Detection
Health devices (fitbit, cardio pacemakers etc)
For System of record (Infinite persistence)
Digital Marketing
Network monitoring
Realtime pricing / analytics
36. Stream Processing Engine - Use Cases Contd...
Intelligence and Surveillance
Risk management (Realtime Asset Coverage)
E-commerce (Realtime customer retention)
Fraud detection (Card, Insurance)
Smart order routing
Transaction cost analysis
Pricing and analytics
Market data management
Algorithmic trading
Data warehouse augmentation
37. Streaming does not mandate BigData
Streaming does not mandate RealTime processing
...but many application types may mandate either or both
44. WAIT! Let's clear a few things up...
Pipelining & Backpressure
Time Semantics (Event, Injestion, Processing etc.)
Windows (count, rolling, session, custom)
Watermarks, Triggers (Inserted into stream)
Checkpoints (Async Recovery - Choice of state store
backend)
"Exactly Once" semantics (no need to question if fail on
send, process, return?)
45. Apache Flink - Features out of the box!
Support for Event Time and Out-of-Order Events
Exactly-once Semantics for Stateful Computations
Highly flexible Streaming Windows & CEP
Continuous Streaming Model with Backpressure (Buffers)
Fault-tolerance via Lightweight Distributed Snapshots
One Runtime for Streaming and Batch Processing
Memory Management & Custom Serialization
Iterations and Delta Iterations
Program Optimizer
SQL (Batch and Streams) due soon in 1.1
46. But I'm only here for the Machine Learning and Graph
Processing!!...
47. Machine Learning in Flink with FlinkML
* Apache Samoa Project - Streaming Machine Learning that works on top of Flink
** Apache Mahout - Batch based Machine Learning that works on top of Flink
55. Supported Distributions / Deployment Options
HortonWorks - Ambari Service (Confirmed full support on
the way)
Cloudera - Not Supported to my knowledge (Discussion
forums ref BigTop)
MapR - Not part of their MapR converged data platform
Amazon EMR (Yarn - Single Instance, Session)
Google Compute Engine (Yarn Support & Hosted
Competitor -> Cloud Dataflow)
Via Apache Myriad on Mesos (Native support coming in
1.2)
64. Whats Next For Flink?
Queryable State (Database inversion! Kafka log, RocksDB)
Release of 1.1+
Dynamic Scaling, Resource Elasticity (i.e. for catchup)
Production Hardening (1,000 node cluster Alibaba)
Stream SQL (Apache Calcite)
CEP Enhancements (large sized async state snapshoting)
Mesos Support
More Connectors
API enhancements (joins, slowly changing inputs)
Security (data encryption, Kerberos with Kafka)
65. Email: john.gorman@amberhand.ie
LinkedIn: johnpgorman
THANK YOU
ACKNOWLEDGEMENTS
Bank Of Ireland - Event and Venue
Hadoop User Group Ireland - Community Building
Data Artisans - Images, Code and Community Support
Anne Ebeling - Dublin Artwork
66. RESOURCES
APACHE FLINK
APACHE FLINK
IN FLINK
CEP MONITORING
RUNNING FLINK ON
BY TYLER AKIDAU
BY TYLER AKIDAU
MAPR FREE EBOOK ON
TRAINING
TAXI STREAM EXAMPLE
BACK PRESSURE CEP
SAMPLE
YARN
STREAMING 101
STREAMING 102
STREAMING ARCHITECTURE