2. CONTENTS
• The Apache Flink Meetup Community
• What is Apache Flink?
• The Dataflow Programming Model
• Who is using Apache Flink?
• Last Year’s Talks (2017-2018)
• What’s new with Flink v.1.6.0 & What’s in store?
• Upcoming Meetups & more
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
4. THE APACHE FLINK MEETUP COMMUNITY
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
4
• … around since 2016
• … a group of enthusiasts, excited about Flink’s potential
• … since then we have successfully run 17 meetups
• … sponsors: Data Reply UK
• … size of the community?
5. THE APACHE FLINK MEETUP COMMUNITY
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
5
• 500+ members!
• ~steady growth rate
• volatile active participation
7. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
7
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams
8. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
8
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams
• … provides a standardised way to build and deploy applications.
9. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
9
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams
• … a computer system (cluster) that uses more than
one computers to run an application.
10. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
10
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams
• … this won’t be a single sentence!
11. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
11
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams
• … this won’t be a single sentence!
12. STATEFUL VS STATELESS COMPUTATIONS
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
12
State in stream processing is as memory in operators:
• remembers information about past input;
• can be used to influence the processing of future input;
• … quite like a Markov Chain
13. STATEFUL VS STATELESS COMPUTATIONS
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
13
Stateless Example:
• Consider a source stream that emits events with schema:
e = {event_id:int, event_value:int}
• Our goal is, for each event, to extract and output the event_value.
14. STATEFUL VS STATELESS COMPUTATIONS
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
14
Stateless Example:
• Consider a source stream that emits events with schema:
e = {event_id:int, event_value:int}
• Our goal is, output the event_value only if it is larger than the value from the previous
event.
State
15. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
15
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams
• … memory in operators
16. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
16
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams
17. WHAT IS APACHE FLINK?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
17
Flink core is a streaming data flow
engine that provides:
• data distribution,
• communication, and;
• fault tolerance;
for distributed computations over
data streams
19. LEVELS OF ABSTRACTION
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
19
• Flink offers different levels of abstraction to develop streaming/batch
applications.
20. PROGRAMS &
DATAFLOWS
The basic building blocks of Flink
programs are:
• streams and;
• transformations.
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
20
21. PARALLEL
DATAFLOWS
• Programs in Flink are inherently
parallel and distributed.
• During execution, a stream has one
or more stream partitions, and
each operator has one or
more operator subtasks.
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
21
22. WINDOWS
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
22
• Aggregating events (e.g., counts, sums) works differently on streams than in batch
processing.
• Data is not bounded so we need windows.
• Windows can be time driven (example: every 30 seconds) or data driven (example:
every 100 elements).
Types of windows:
• tumbling windows (no
overlap),
• sliding windows (with
overlap),and;
• session windows (punctuated
by a gap of inactivity).
23. TIME
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
23
Different notions of
time:
• Event Time
• Ingestion Time
• Processing Time
24. STATEFUL OPERATIONS
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
24
• Some operations in a dataflow simply look at
one individual event at a time.
• Others operations remember information
across multiple events (for example window
operators). These operations are
called stateful.
• The state of stateful operations is maintained
in what can be thought of as an embedded
key/value store.
26. 10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
26 • Alibaba, the world's largest retailer,
uses a fork of Flink called Blink to
optimize search rankings in real time.
• Ebay's monitoring platform is
powered by Flink and evaluates
thousands of customizable alert rules
on metrics and log streams.
• Huawei is a leading global provider
of ICT infrastructure and smart
devices. Huawei Cloud provides
Cloud Service based on Flink.
• Uber built their internal SQL-based,
open-source streaming analytics
platform AthenaX on Apache Flink.
27. 10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
27
Apache Flink® user survey by dataArtisans
• Enterprises are investing heavily in stream
processing technology
• 87% planning to deploy more applications
powered by Apache Flink software in 2018
• 64% Machine Learning
• 34% Model Scoring
• 30% Model Training
• 27% Anomaly Detection/System Monitoring
• 25% Business Intelligence
“… the ability to react to data in the moment is
becoming a top priority among enterprises of all
sizes”
29. LAST YEAR’S TALKS (2017-18)
• Aris Koliopoulos & Alex Garella – “Panta Rhei: designing distributed
applications with streams.”
• Patrick Lucas, giving a lightning talk on “Best practices around Flink state types
(List/Map/ValueState etc).”
• Stavros Kontopoulos with “Let’s talk ML on Flink”
• Stephan Ewen (CTO & CO-Founder of Data Artisans), presenting “Stream SQL
and Realtime Applications with Apache Flink”
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
29
30. PANTA RHEI: DESIGNING DISTRIBUTED APPLICATIONS
WITH STREAMS
ARIS KOLIOPOULOS & ALEX GARELLA
• DriveTribe: a digital automotive community platform
founded by, and featuring content from The Grand Tour
presenters
• Users consume feeds and interact with a variety of
content: videos, images, articles
• Problem: They wanted a scalable way to produced
personalised rankings of articles for users.
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
30
31. PANTA RHEI: DESIGNING DISTRIBUTED APPLICATIONS
WITH STREAMS
ARIS KOLIOPOULOS & ALEX GARELLA
What they tried:
1. Stored data into a DB store and computed the aggregate stores on the fly
o Was very slow (high read time) and didn’t scale.
2. Tried computing aggregations at write time with the intention of reducing read time:
1 write can fetch all views at once
o Not fault tolerant; one read fails they all fail.
o What about state mutations on the read data?
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
31
32. PANTA RHEI: DESIGNING DISTRIBUTED APPLICATIONS
WITH STREAMS
ARIS KOLIOPOULOS & ALEX GARELLA
Solution: Treat event streams as source of truth for applications—a powerful alternative
to using RPCs, Enterprise Messaging or a Shared Database to communicate and share
data across different applications or microservices
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
32
1. Clients send events to the API
(John liked Jeremy’s post)
2. Events are immutable; they
capture a certain action at some
point in time
3. Every application state instance
can be modelled as a projection
of those events
33. BEST PRACTICES AROUND FLINK STATE TYPES
(LIST/MAP/VALUESTATE)
PATRICK LUCAS
Different types of Managed
States:
• ValueState<T>
• ListState<T>
• ReducingState<T>
• AggregatingState<IN, OUT>
• FoldingState<T, ACC>
• MapState<UK, UV>
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
33
• “The cost of very frequent updates
(serialization/deserialisation)” … illustrated how we
can use of transient variables to do that.
• “When to use ReduceState vs AggregatingState
vs FoldingState?”
Also, discussed the beta version of Queryable State.
34. LET’S TALK ML ON FLINK
STAVROS KONTOPOULOS
• How about running model serving
“natively”, inside Flink server?
• How? Use dynamically controlled
stream approach—models are
delivered to running
implementation via model’s
stream and dynamically
instantiated for usage.
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
34
Proposition: Build a streaming system allowing to update models without interruption of
execution
35. STREAM SQL AND REALTIME APPLICATIONS WITH
APACHE FLINK
STEPHAN EWEN (CTO & CO-FOUNDER OF DATA ARTISANS)
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
35
SQL was not designed for streams:
• Relations are bounded (multi-)sets while streams are infinite
sequences
• DBMS can access all data while streaming data arrives over time
• SQL queries return a result and end while streaming queries
continuously emit results and never end
36. STREAM SQL AND REALTIME APPLICATIONS WITH
APACHE FLINK
STEPHAN EWEN (CTO & CO-FOUNDER OF DATA ARTISANS)
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
36
DBMS run queries on streams all the time!
• Materialised Views (MVs) are used to speed up analytical queries
• They need to update when tables change
• MV maintenance is very similar to MVs:
• Table updates are a stream of statements
• MV definitions (queries) are evaluated (continuously) on that stream
37. STREAM SQL AND REALTIME APPLICATIONS WITH
APACHE FLINK
STEPHAN EWEN (CTO & CO-FOUNDER OF DATA ARTISANS)
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
37
What about windows?
39. WHAT’S NEW WITH FLINK V.1.6.0
• Simplifying Apache Flink’s state with the addition of
native support for state TTL.
• Further improvements to the Streaming SQL CLI, including
simplifying the executions of streaming and batch
queries against different data sources
• Improved Flink connectors allowing better integration with
external systems.
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
39
40. WHAT’S IN STORE?
• Integration of SQL and CEP
• Unified checkpoints and savepoints
• An improved Flink deployment and process model
• Fine-grained recovery from task failures
• An SQL Client to execute SQL queries against batch and streaming tables.
• Serving of machine learning models.
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
40
42. CLICKSTREAM PROCESSING AT THE FINANCIAL
TIMES
The Financial Times (FT) process millions of customer
events per day. The ability to monitor such events in real-
time is crucial for attracting new customers, monitoring
the popularity of articles and personalising experiences.
In this talk, the Flink team, will show us:
• how they use Flink to process their clickstream;
• how they operate the pipeline using Docker Swarm in
AWS;
• how they keep secrets safe using Vault, and;
• how they monitor it with Prometheus and Grafana.
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
42
43. LIGHTNING TALKS
• Give back to the community!
• Have an idea you want to discuss?
• Have done work you want to talk about?
• Found out about a new concept and what to present it?
Come, do a lightning talk!
15 mins of pure excitement & passion!
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
43
44. 10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK
Editor's Notes
We started of in 2016
We are excited about its potential, and we want to find other people who are interested. Apache Flink is a 'streaming first' data processing engine
… active participation is something that we want to change in the future (we will discuss this further around the end of this presentation)
This includes parallel processing in which a single computer uses more than one CPU to execute programs.
This includes parallel processing in which a single computer uses more than one CPU to execute programs.
This includes parallel processing in which a single computer uses more than one CPU to execute programs.
At a high level, we can consider state in stream processing as memory in operators that remembers information about past input and can be used to influence the processing of future input.
… like a Markov Chain: A Markov chain is "a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event"
In contrast, operators in stateless stream processing only consider their current inputs, without further context and knowledge about the past.
A simple example to illustrate this difference: let us consider a source stream that emits events with schema e = {event_id:int, event_value:int}.
Our goal is, for each event, to extract and output the event_value.
We can easily achieve this with a simple source-map-sink pipeline, where the map function extracts the event_value from the event and emits it downstream to an outputting sink.
This is an instance of stateless stream processing.
But what if we want to modify our job to output the event_value only if it is larger than the value from the previous event?
In this case, our map function obviously needs some way to remember the event_value from a past event — and so this is an instance of stateful stream processing.
This example should demonstrate that state is a fundamental, enabling concept in stream processing that is required for a majority of interesting use cases.
There are of course, more complex states such as keeping a state-machine for detecting patterns for fraudulent financial transactions or holding a model for some machine learning application
Any kind of data is produced as a stream of events. Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream.
Unbounded streams have a start but no defined end. They do not terminate and provide data as it is generated. Unbounded streams must be continuously processed, i.e., events must be promptly handled after they have been ingested.
Bounded streams have a defined start and end. Bounded streams can be processed by ingesting all data before performing any computations.
Flink is a layered system. The different layers of the stack build on top of each other and raise the abstraction level of the program representations they accept:
The runtime layer receives a program in the form of a JobGraph.
Both the DataStream API and the DataSet API generate JobGraphs through separate compilation processes. The DataSet API uses an optimizer to determine the optimal plan for the program, while the DataStream API uses a stream builder.
The JobGraph is executed according to a variety of deployment options available in Flink (e.g., local, remote, YARN (resource management and job schedulling managers), etc)
Libraries and APIs that are bundled with Flink generate DataSet or DataStream API programs. These are Table for queries on logical tables, FlinkML for Machine Learning, and Gelly for graph processing.
The lowest level abstraction simply offers stateful streaming. It is embedded into the DataStream API via the Process Function. It allows users freely process events from one or more streams, and use consistent fault tolerant state. In addition, users can register event time and processing time callbacks, allowing programs to realize sophisticated computations.
In practice, most applications would not need the lowest level abstraction, but would instead program against the Core APIs like the DataStream API (bounded/unbounded streams) and the DataSet API (bounded data sets).
These APIs offer the common building blocks for data processing, like various forms of user-specified transformations, joins, aggregations, windows, state, etc.
The Table API is a declarative Domain Specific Language centered around tables
One can seamlessly convert between tables and DataStream/DataSet, allowing programs to mix Table API and with the DataStream and DataSet APIs.
And, at the highest level abstraction offered by Flink is SQL. This abstraction is similar to the Table API both in semantics and expressiveness, but represents programs as SQL query expressions.
Conceptually a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result.
When executed, Flink programs are mapped to streaming dataflows, consisting of streams and transformation operators. Each dataflow starts with one or more sources and ends in one or more sinks. The dataflows resemble arbitrary directed acyclic graphs(DAGs).
The operator subtasks are independent of one another, and execute in different threads and possibly on different machines or containers.
Aggregating events (e.g., counts, sums) works differently on streams than in batch processing. For example, it is impossible to count all elements in a stream, because streams are in general infinite (unbounded). Instead, aggregates on streams (counts, sums, etc), are scoped by windows, such as “count over the last 5 minutes”, or “sum of the last 100 elements”
When referring to time in a streaming program (for example to define windows), one can refer to different notions of time:
Event Time is the time when an event was created. It is usually described by a timestamp in the events, for example attached by the producing sensor, or the producing service. Flink accesses event timestamps via timestamp assigners.
Ingestion time is the time when an event enters the Flink dataflow at the source operator.
Processing Time is the local time at each operator that performs a time-based operation.
While many operations in a dataflow simply look at one individual event at a time (for example an event parser), some operations remember information across multiple events (for example window operators). These operations are called stateful.
The state of stateful operations is maintained in what can be thought of as an embedded key/value store. The state is partitioned and distributed strictly together with the streams that are read by the stateful operators. Hence, access to the key/value state is only possible on keyed streams, after a keyBy() function, and is restricted to the values associated with the current event’s key. Aligning the keys of streams and state makes sure that all state updates are local operations, guaranteeing consistency without transaction overhead. This alignment also allows Flink to redistribute the state and adjust the stream partitioning transparently.
https://flink.apache.org/poweredby.html
Enterprises are investing heavily in stream processing technology, according to the second annual Apache Flink® user survey data Artisans announced: the vast majority (87 percent) of organizations surveyed are planning to deploy more applications powered by Apache Flink software in 2018. Of dozens of new application types developers are building or planning to build, machine learning (64 percent) both for model scoring (34 percent) and model training (30 percent), anomaly detection/system monitoring (27 percent) and business intelligence/reporting (25 percent) are the most popular, followed by recommendation/decisioning engines (22 percent) and security/fraud detection (19 percent), to round out the top five.
Most respondents (70 percent) say their team or department is growing and hiring in 2018. Nearly as many (59 percent) expect their team or departmental budget to increase.
Drawing on these insights it seems like the ability to react to data in the moment is becoming a top priority among enterprises of all sizes
A pattern where replayable logs, like Apache Kafka, are used for both communication as well as event storage, incorporating the retentive properties of a database in a system designed to share data across many teams, clouds and geographies.
ValueState<T>: This keeps a value that can be updated and retrieved
ListState<T>: This keeps a list of elements. You can append elements and retrieve an Iterable
ReducingState<T>: This keeps a single value that represents the aggregation of all values added to the state.
AggregatingState<IN, OUT>: Contrary to ReducingState, the aggregate type may be different from the type of elements that are added to the state.
FoldingState<T, ACC>: Same as AggregatingState but here values are folded into an aggregate using a specified FoldFunction.
MapState<UK, UV>: This keeps a list of mappings.
Machine Learning/Deep Learning models can be used in different ways to do predictions. My preferred way is to deploy an analytic model directly into a stream processing application (like Kafka Streams). This allows for better latency and independence of external services.
However, direct deployment of models is not always a feasible approach. Sometimes it makes sense or is needed to deploy a model in another serving infrastructure like TensorFlow Serving for TensorFlow models.
Model Inference is then done via Remote Procedure Ccalls/Request Response communication.
Organizational or technical reasons might force this approach.
Stavros said how about running model serving natively and in this case inside the FLink server.
Use dynamically controlled stream approach—models are delivered to running implementation via model’s stream and dynamically instantiated for usage.
… as new events come through the live event stream, we’re able to evaluate them against the newly-added models (or rules).
DBMS run queries on streams all the time!
Materialised Views (MVs) are used to speed up analytical queries
They need to update when tables change
MV maintenance is very similar to MVs:
Table updates are a stream of statements
MV definitions (queries) are evaluated (continuously) on that stream
What is a materialised view?
Whenever a query or an update addresses an ordinary view's virtual table, the DBMS converts these into queries or updates against the underlying base tables. A materialized view takes a different approach: the query result is cached as a concrete ("materialized") table (rather than a view as such) that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of extra storage and of some data being potentially out-of-date.
Core concept is a dynamic table which change over time
Queries on dynamic tables produce new dynamic tables which are updated based on input and do not terminate
In the figure you can see the process of dynamic table conversion
Number of clicks in the last hour
Simplifying Apache Flink’s state with the addition of native support for state TTL (Time to Leave). This feature allows to clean up state after it has expired. With Flink 1.6.0 timer state can now go out of core by storing the relevant state in RocksDB. Moreover, the team improved the deletion of timers significantly.
Support for resource elasticity and different deployment scenarios (such as better container integration). Flink 1.6.0 comes with HTTP/REST based external communications and job submissions as well as a container entrypoint for simplified bootstrapping of containerized job clusters.
Further improvements to the Streaming SQL CLI, including simplifying the executions of streaming and batch queries against different data sources, adding full Avro support for reading easily any kind of Avro data and hardening Flink’s CEP library to handle significantly larger state sizes compared to past versions.
Improved Flink connectors allowing better integration with external systems. The additions to Flink 1.6.0 include a new StreamingFileSink that replaces the BucketingSink as the standard file sink from previous versions, support for ElasticSearch 6.x and different AvroDeserializationSchemasto seamlessly ingest Avro data.
Integration of SQL and CEP, as described in FLIP-20 to allow developers to create complex event processing (CEP) patterns using SQL statements.
Unified checkpoints and savepoints, as described in FLIP-10, to allow savepoints to be triggered automatically–important for program updates for the sake of error handling because savepoints allow the user to modify both the job and Flink version whereas checkpoints can only be recovered with the same job.
An improved Flink deployment and process model, as described in FLIP-6, to allow for better integration with Flink and cluster managers and deployment technologies such as Mesos, Docker, and Kubernetes.
Fine-grained recovery from task failures, as described in FLIP-1 to improve recovery efficiency and only re-execute failed tasks, reducing the amount of state that Flink needs to transfer on recovery.
An SQL Client, as described in FLIP-24 to add a service and a client to execute SQL queries against batch and streaming tables.
Serving of machine learning models, as described in FLIP-23 to add a library that allows users to apply offline-trained machine learning models to data streams.