Flink Meetup Septmeber 2017 2018

2017-18
Apache Flink
Meetup
Year Review

CONTENTS
• The Apache Flink Meetup Community
• What is Apache Flink?
• The Dataflow Programming Model
• Who is using Apache Flink?
• Last Year’s Talks (2017-2018)
• What’s new with Flink v.1.6.0 & What’s in store?
• Upcoming Meetups & more
10/9/18Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK

THE APACHE FLINK
MEETUP
COMMUNITY

THE APACHE FLINK MEETUP COMMUNITY
4
• … around since 2016
• … a group of enthusiasts, excited about Flink’s potential
• … since then we have successfully run 17 meetups
• … sponsors: Data Reply UK
• … size of the community?

THE APACHE FLINK MEETUP COMMUNITY
5
• 500+ members!
• ~steady growth rate
• volatile active participation

WHAT IS APACHE FLINK?
7
Apache Flink is a framework and distributed processing engine for
stateful computations over unbounded and bounded data streams

8
• … provides a standardised way to build and deploy applications.

9
• … a computer system (cluster) that uses more than
one computers to run an application.

10
• … this won’t be a single sentence!

11
• … this won’t be a single sentence!

STATEFUL VS STATELESS COMPUTATIONS
12
State in stream processing is as memory in operators:
• remembers information about past input;
• can be used to influence the processing of future input;
• … quite like a Markov Chain

13
Stateless Example:
• Consider a source stream that emits events with schema:
e = {event_id:int, event_value:int}
• Our goal is, for each event, to extract and output the event_value.

14
Stateless Example:
• Consider a source stream that emits events with schema:
e = {event_id:int, event_value:int}
• Our goal is, output the event_value only if it is larger than the value from the previous
event.
State

15
• … memory in operators

16

17
Flink core is a streaming data flow
engine that provides:
• data distribution,
• communication, and;
• fault tolerance;
for distributed computations over
data streams

THE DATAFLOW
PROGRAMMING
MODEL

LEVELS OF ABSTRACTION
19
• Flink offers different levels of abstraction to develop streaming/batch
applications.

PROGRAMS &
DATAFLOWS
The basic building blocks of Flink
programs are:
• streams and;
• transformations.
20

PARALLEL
DATAFLOWS
• Programs in Flink are inherently
parallel and distributed.
• During execution, a stream has one
or more stream partitions, and
each operator has one or
more operator subtasks.
21

WINDOWS
22
• Aggregating events (e.g., counts, sums) works differently on streams than in batch
processing.
• Data is not bounded so we need windows.
• Windows can be time driven (example: every 30 seconds) or data driven (example:
every 100 elements).
Types of windows:
• tumbling windows (no
overlap),
• sliding windows (with
overlap),and;
• session windows (punctuated
by a gap of inactivity).

TIME
23
Different notions of
time:
• Event Time
• Ingestion Time
• Processing Time

STATEFUL OPERATIONS
24
• Some operations in a dataflow simply look at
one individual event at a time.
• Others operations remember information
across multiple events (for example window
operators). These operations are
called stateful.
• The state of stateful operations is maintained
in what can be thought of as an embedded
key/value store.

26 • Alibaba, the world's largest retailer,
uses a fork of Flink called Blink to
optimize search rankings in real time.
• Ebay's monitoring platform is
powered by Flink and evaluates
thousands of customizable alert rules
on metrics and log streams.
• Huawei is a leading global provider
of ICT infrastructure and smart
devices. Huawei Cloud provides
Cloud Service based on Flink.
• Uber built their internal SQL-based,
open-source streaming analytics
platform AthenaX on Apache Flink.

27
Apache Flink® user survey by dataArtisans
• Enterprises are investing heavily in stream
processing technology
• 87% planning to deploy more applications
powered by Apache Flink software in 2018
• 64% Machine Learning
• 34% Model Scoring
• 30% Model Training
• 27% Anomaly Detection/System Monitoring
• 25% Business Intelligence
“… the ability to react to data in the moment is
becoming a top priority among enterprises of all
sizes”

LAST YEAR’S TALKS
(2017-2018)

LAST YEAR’S TALKS (2017-18)
• Aris Koliopoulos & Alex Garella – “Panta Rhei: designing distributed
applications with streams.”
• Patrick Lucas, giving a lightning talk on “Best practices around Flink state types
(List/Map/ValueState etc).”
• Stavros Kontopoulos with “Let’s talk ML on Flink”
• Stephan Ewen (CTO & CO-Founder of Data Artisans), presenting “Stream SQL
and Realtime Applications with Apache Flink”
29

PANTA RHEI: DESIGNING DISTRIBUTED APPLICATIONS
WITH STREAMS
ARIS KOLIOPOULOS & ALEX GARELLA
• DriveTribe: a digital automotive community platform
founded by, and featuring content from The Grand Tour
presenters
• Users consume feeds and interact with a variety of
content: videos, images, articles
• Problem: They wanted a scalable way to produced
personalised rankings of articles for users.
30

WITH STREAMS
What they tried:
1. Stored data into a DB store and computed the aggregate stores on the fly
o Was very slow (high read time) and didn’t scale.
2. Tried computing aggregations at write time with the intention of reducing read time:
 1 write can fetch all views at once
o Not fault tolerant; one read fails they all fail.
o What about state mutations on the read data?
31

WITH STREAMS
Solution: Treat event streams as source of truth for applications—a powerful alternative
to using RPCs, Enterprise Messaging or a Shared Database to communicate and share
data across different applications or microservices
32
1. Clients send events to the API
(John liked Jeremy’s post)
2. Events are immutable; they
capture a certain action at some
point in time
3. Every application state instance
can be modelled as a projection
of those events

BEST PRACTICES AROUND FLINK STATE TYPES
(LIST/MAP/VALUESTATE)
PATRICK LUCAS
Different types of Managed
States:
• ValueState<T>
• ListState<T>
• ReducingState<T>
• AggregatingState<IN, OUT>
• FoldingState<T, ACC>
• MapState<UK, UV>
33
• “The cost of very frequent updates
(serialization/deserialisation)” … illustrated how we
can use of transient variables to do that.
• “When to use ReduceState vs AggregatingState
vs FoldingState?”
Also, discussed the beta version of Queryable State.

LET’S TALK ML ON FLINK
STAVROS KONTOPOULOS
• How about running model serving
“natively”, inside Flink server?
• How? Use dynamically controlled
stream approach—models are
delivered to running
implementation via model’s
stream and dynamically
instantiated for usage.
34
Proposition: Build a streaming system allowing to update models without interruption of
execution

STREAM SQL AND REALTIME APPLICATIONS WITH
APACHE FLINK
STEPHAN EWEN (CTO & CO-FOUNDER OF DATA ARTISANS)
35
SQL was not designed for streams:
• Relations are bounded (multi-)sets while streams are infinite
sequences
• DBMS can access all data while streaming data arrives over time
• SQL queries return a result and end while streaming queries
continuously emit results and never end

APACHE FLINK
36
DBMS run queries on streams all the time!
• Materialised Views (MVs) are used to speed up analytical queries
• They need to update when tables change
• MV maintenance is very similar to MVs:
• Table updates are a stream of statements
• MV definitions (queries) are evaluated (continuously) on that stream

APACHE FLINK
37
What about windows?

WHAT’S NEW WITH
FLINK V.1.6.0
& WHAT’S IN STORE?

WHAT’S NEW WITH FLINK V.1.6.0
• Simplifying Apache Flink’s state with the addition of
native support for state TTL.
• Further improvements to the Streaming SQL CLI, including
simplifying the executions of streaming and batch
queries against different data sources
• Improved Flink connectors allowing better integration with
external systems.
39

WHAT’S IN STORE?
• Integration of SQL and CEP
• Unified checkpoints and savepoints
• An improved Flink deployment and process model
• Fine-grained recovery from task failures
• An SQL Client to execute SQL queries against batch and streaming tables.
• Serving of machine learning models.
40

CLICKSTREAM PROCESSING AT THE FINANCIAL
TIMES
The Financial Times (FT) process millions of customer
events per day. The ability to monitor such events in real-
time is crucial for attracting new customers, monitoring
the popularity of articles and personalising experiences.
In this talk, the Flink team, will show us:
• how they use Flink to process their clickstream;
• how they operate the pipeline using Docker Swarm in
AWS;
• how they keep secrets safe using Vault, and;
• how they monitor it with Prometheus and Grafana.
42

LIGHTNING TALKS
• Give back to the community!
• Have an idea you want to discuss?
• Have done work you want to talk about?
• Found out about a new concept and what to present it?
Come, do a lightning talk!
15 mins of pure excitement & passion!
43

Dr. Christos Hadjinikolis | Senior ML Engineer | Data Reply UK

Flink Meetup Septmeber 2017 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink Meetup Septmeber 2017 2018

Similar to Flink Meetup Septmeber 2017 2018 (20)

Recently uploaded

Recently uploaded (20)

Flink Meetup Septmeber 2017 2018

Editor's Notes