SlideShare une entreprise Scribd logo
1  sur  153
@arafkarsh arafkarsh
ARAF KARSH HAMID
Co-Founder / CTO
MetaMagic Global Inc., NJ, USA
@arafkarsh
arafkarsh
Microservice
Architecture Series
Building Cloud Native Apps
Kinesis Data Steams
Kinesis Firehose
Kinesis Data Analytics
Apache Flink
Part 3 of 11
@arafkarsh arafkarsh 2
Slides are color coded based on the topic colors.
AWS Kinesis
Video Streams
Data Streams
1
AWS Kinesis
Data Firehose
Data Analytics
2
Apache Flink
Streams
Table / SQL
3
Kinesis
Case Studies 4
@arafkarsh arafkarsh
Agile
Scrum (4-6 Weeks)
Developer Journey
Monolithic
Domain Driven Design
Event Sourcing and CQRS
Waterfall
Optional
Design
Patterns
Continuous Integration (CI)
6/12 Months
Enterprise Service Bus
Relational Database [SQL] / NoSQL
Development QA / QC Ops
3
Microservices
Domain Driven Design
Event Sourcing and CQRS
Scrum / Kanban (1-5 Days)
Mandatory
Design
Patterns
Infrastructure Design Patterns
CI
DevOps
Event Streaming / Replicated Logs
SQL NoSQL
CD
Container Orchestrator Service Mesh
@arafkarsh arafkarsh
Application Modernization – 3 Transformations
4
Monolithic SOA Microservice
Physical
Server
Virtual
Machine
Cloud
Waterfall Agile DevOps
Source: IBM: Application Modernization > https://www.youtube.com/watch?v=RJ3UQSxwGFY
Architecture
Infrastructure
Delivery
@arafkarsh arafkarsh
Application Modernization – 3 Transformations
Monolithic SOA Microservice
Physical
Server
Virtual
Machine
Cloud
Waterfall Agile DevOps
Source: IBM: Application Modernization > https://www.youtube.com/watch?v=RJ3UQSxwGFY
Architecture
Infrastructure
Delivery
Modernization
1
2
3
5
@arafkarsh arafkarsh
Microservices Principles….
6
Components
via
Services
Organized around
Business
Capabilities
Products
NOT
Projects
Smart
Endpoints
& Dumb Pipes
Decentralized
Governance &
Data Management
Infrastructure
Automation
Design for
Failure
Evolutionary
Design
@arafkarsh arafkarsh
AWS Kinesis
• Data Streams
• Video Streams
7
1
Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
@arafkarsh arafkarsh
AWS Kinesis - Purpose
1. Collect
2. Process
3. Analyze Realtime
4. Streaming Data
Ingest Realtime Data
1. Video
2. Audio
3. Application Logs
4. Website Click
Streams
IoT Telemetry Data
1. Analytics
2. Machine Learning
8
@arafkarsh arafkarsh
AWS Kinesis
Kinesis Video Streams
helps you to securely stream
video from systems to AWS
for processing such as
Analytics, Machine Learning
and others.
Kinesis Data Streams
are a highly Scalable, Durable,
& Realtime data streaming
service that can capture
Gigabytes of data per second
different data sources.
Kinesis Data Firehose
is used to Extract, Load,
Transform (ETL) data
streams into AWS stores
like S3, Redshift, Open
Search etc. for near
Realtime data analytics.
Kinesis Data Analytics
is used to process the
real-time streams in
SQL or Java or Python.
9
@arafkarsh arafkarsh
Streaming Data
• Continuously generated Data to be processed sequentially or incrementally
• Data is sent record by record by thousands or over a sliding time windows of
Data Sources
Use Cases
Gaming Stock Market
Real Estate Transport Applications
10
@arafkarsh arafkarsh
Kinesis Video Streams
Devices
Processing
• AWS Rekognition
• AWS Sage Maker
• Tensor Flow
• HLS Playback
• Custom Video
Processing
• Automatically scales the infrastructure needed for streaming video data from devices
• Stream video from connected devices to AWS for Analytics, Machine Learning, Playback etc.
• Stores, Encrypts and indexes video data and access the data using APIs
HLS – HTTP Live Streaming
INPUT
Kinesis Video
Stream
11
@arafkarsh arafkarsh
Kinesis Data Streams
Applications
Processing
• Kinesis Data
Analytics
• Spark
• AWS EC2
• AWS Lambda
• Kinesis Data Streams are Highly Scalable and Durable Real-time streaming
• Stream Data from connected devices to AWS for Analytics, Machine Learning. etc.
INPUT
Kinesis Data
Stream
12
@arafkarsh arafkarsh
Kinesis Data Streams: Example
Applications
• Raw Events are coming from Cart Checkout
• Using the Lambda, the Raw Event is Enriched and send to another Stream for further processing
Event Producer
Kinesis Data
Stream
Raw Events
13
Kinesis Data
Stream
Enriched Events
Enrich the
Checkout Event
IN OUT
Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
@arafkarsh arafkarsh
Kinesis Data Firehose
Store Data
• AWS S3
• AWS Redshift
• AWS Elastic
Search
• Splunk
• Kinesis Data Firehose is to store the streaming data into Data Stores, Lakes etc.
• Firehose is used to Capture, Transform and Load Data into S3, Redshift etc.
Kinesis Data
Stream
Kinesis Data
Firehose
Data
Transformation
using Lambda
14
@arafkarsh arafkarsh
Kinesis Data Analytics
• Kinesis Data Analytics is used to analyze the streaming Data
• Reduces the complexity in building and deploying Analytics Applications
• Provides built-in Functions to Filter, Aggregate and Transform Streaming Data
• Serverless Architecture
• Under the hood its Apache Flink (v1.13)
INPUT
Kinesis Data Stream
Kinesis Data
Analytics
OUTPUT
Kinesis Data Stream
15
@arafkarsh arafkarsh
AWS Kinesis – Summary
16
Kinesis Video Streams
helps you to securely stream
video from systems to AWS
for processing such as
Analytics, Machine Learning
and others.
Kinesis Data Streams
are a highly Scalable, Durable,
& Realtime data streaming
service that can capture
Gigabytes of data per second
different data sources.
Kinesis Data Firehose
is used to Extract, Load,
Transform (ETL) data
streams into AWS stores
like S3, Redshift, Open
Search etc. for near
Realtime data analytics.
Kinesis Data Analytics
is used to process the
real-time streams in
SQL or Java or Python.
@arafkarsh arafkarsh
Kinesis Data Streams
Producers
Consumers
17
Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
@arafkarsh arafkarsh
How it works
Source: https://aws.amazon.com/kinesis/data-streams/
18
@arafkarsh arafkarsh
Architecture
19
Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
@arafkarsh arafkarsh
Kinesis Data Streams
Data Record
The atomic unit of data in a Data Stream stored in Kinesis
Data Stream
Collection of Data Records streamed and stored in multiple
shards.
Data Record
Data Record
Data Record
Data Record
Data Stream
Data Record Data Record Data Record Shard 1
Data Record Data Record Data Record Shard 2
Data Record Data Record Data Record Shard n
Data Stream
20
Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
Producer puts the Data
Records into the Shards
and
Consumer retrieves the
data from the Shard.
@arafkarsh arafkarsh
Kinesis Data Streams: Shards
21
• A shard is a uniquely identified sequence of data records in a stream.
• A stream is composed of one or more shards, each of which provides a fixed unit of
capacity.
• Each shard can support up to 5 transactions per second for reads, up to a maximum total
data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a
maximum total data write rate of 1 MB per second (including partition keys).
• The data capacity of your stream is a function of the number of shards that you specify
for the stream.
Data Record Data Record Data Record Shard 1
Data Record Data Record Data Record Shard 2
Data Record Data Record Data Record Shard n
Data
Stream
Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
• The total capacity of
the stream is the sum
of the capacities of its
shards.
@arafkarsh arafkarsh
Kinesis Data Streams: Partition Keys
22
Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
Partition Key Data BLOB
• A partition key is used to group data by shard within a stream.
• Kinesis Data Streams segregates the data records belonging to a stream into
multiple shards.
• It uses the partition key that is associated with each data record to determine
which shard a given data record belongs to.
• Partition keys are Unicode strings, with a maximum length limit of 256 characters
for each key.
• An MD5 hash function is used to map partition keys to 128-bit integer values and
to map associated data records to shards using the hash key ranges of the shards.
• When an application puts data into a stream, it must specify a partition key.
@arafkarsh arafkarsh
Kinesis Data Streams: Sequence Number
23
• Each data record has a sequence number that is
unique per partition-key within its shard.
• Kinesis Data Streams assigns the sequence number
after you write to the stream
with client.putRecords or client.putRecord.
• Sequence numbers for the same partition key generally
increase over time.
• The longer the time period between write requests, the
larger the sequence numbers become.
Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
@arafkarsh arafkarsh
Kinesis
Data Stream
Lambda Config
24
Example Source:
https://github.com/MetaArivu/Kinesis-Quickstart
You can Control the Stream
• Batch Size
• Batch Window in Seconds
• Max Retry
@arafkarsh arafkarsh
Kinesis
Data Stream
Lambda
25
Example Source:
https://github.com/MetaArivu/Kinesis-Quickstart
@arafkarsh arafkarsh
Multi Consumer Fan out
Source: https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html
26
@arafkarsh arafkarsh
Data Stream – On Demand Scaling
On-Demand
• Automatically provisions the
infrastructure
• Max 200 MiB per Second OR
• Max 200K Records per Second
27
@arafkarsh arafkarsh
Data Stream – Retention 1 Day
28
@arafkarsh arafkarsh
Data Stream – Retention 365 Days
Retention Days
• Min 1 Day
• Max 365 Days
29
@arafkarsh arafkarsh
Data Stream - Monitoring
30
@arafkarsh arafkarsh
Security
• Data is automatically encrypted before its stored in the
Shard.
• Encryption is done using AWS KMS Customer Master
Key
Server-Side Encryption
31
@arafkarsh arafkarsh
Kinesis Video Streams
• Realtime using WebRTC
• Batch Mode
32
@arafkarsh arafkarsh
Kinesis
Video
Streams
Video Producer Library
1. Java
2. Android
3. C++
4. C
33
@arafkarsh arafkarsh
Kinesis Video Stream
Source: https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works.html
Producer can be any video-generating device, such
as a
• security camera,
• a body-worn camera,
• a smartphone camera, or a
• Dashboard camera.
• A producer can also send non-video data, such
as audio feeds, images, or RADAR data.
A single producer can generate one or more video
streams. For example, a video camera can push
video data to one Kinesis video stream and audio
data to another.
Kinesis Video Streams Producer libraries
• Install and configure on your devices.
• Securely connect and reliably stream video in different
ways,
• including in real time, after buffering it for a few
seconds,
• or as after-the-fact media uploads.
34
@arafkarsh arafkarsh
Kinesis Video Stream End Points
Examples: Sending Data to Kinesis Video Streams
• Example: Kinesis Video Streams Producer SDK GStreamer
Plugin: Shows how to build the Kinesis Video Streams
Producer SDK to use as a GStreamer destination.
• Run the GStreamer Element in a Docker Container: Shows
how to use a pre-built Docker image for sending RTSP
video from an IP camera to Kinesis Video Streams.
• Example: Streaming from an RTSP Source: Shows how to
build your own Docker image and send RTSP video from an
IP camera to Kinesis Video Streams.
• Example: Sending Data to Kinesis Video Streams Using the
PutMedia API: Shows how to use the Using the Java
Producer Library to send data to Kinesis Video Streams that
is already in a container format Matroska (MKV) using
the PutMedia API.
GStreamer is a popular media
framework used by a multitude of
cameras and video sources to create
custom media pipelines by combining
modular plugins.
• RTSP Camera on Ubuntu
• USB Camera on Ubuntu
• Camera on Raspberry Pi
Source:
https://docs.aws.amazon.com/kine
sisvideostreams/latest/dg/examples
-gstreamer-plugin.html
35
@arafkarsh arafkarsh
Kinesis Video Stream
Kinesis video stream
• Transport live video data, optionally store it
• Data available for consumption both in real
time
and on a batch or ad hoc basis.
• A Kinesis video stream has only one producer
publishing data into it.
The stream can carry
• audio,
• video, and
• similar time-encoded data streams, such as
• depth sensing feeds,
• RADAR feeds, and more.
Source: https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works.html
Kinesis Video Stream Consumer (App)
• Gets data, such as fragments and frames,
from a Kinesis video stream
• To view, process, or analyse it.
Kinesis Video Stream Parser Library
• To reliably get media from Kinesis video
streams in a low-latency manner.
• It parses the frame boundaries in the
media so that applications can focus on
processing and analysing the frames
themselves.
36
@arafkarsh arafkarsh
Kinesis Video Stream Parser Library
• StreamingMkvReader: This class reads specified MKV elements from a video stream.
• FragmentMetadataVisitor: This class retrieves metadata for fragments (media
elements) and tracks (individual data streams containing media information, such as
audio or subtitles).
• OutputSegmentMerger: This class merges consecutive fragments or chunks in a video
stream.
• KinesisVideoExample: This is a sample application that shows how to use the Kinesis
Video Stream Parser Library.
The library also includes tests that show how the tools are used.
37
@arafkarsh arafkarsh
Kinesis Data Firehose
38
2
Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
@arafkarsh arafkarsh
Kinesis Data Firehose
Store Data
• AWS S3
• AWS Redshift
• AWS Elastic
Search
• Splunk
• Kinesis Data Firehose is to Store the Streaming data into Data Stores, Lakes etc.
• Firehose is used to Capture, Transform & Load Data into S3, Redshift etc.
Kinesis Data
Stream
Kinesis Data
Firehose
Data
Transformation
using Lambda
39
@arafkarsh arafkarsh
Kinesis Data Firehose – Transformation Lambda
40
recordId
• The record ID is passed from Kinesis Data Firehose to Lambda during the invocation.
• The transformed record must contain the same record ID.
• Any mismatch between the ID of the original record and the ID of the transformed record is treated
as a data transformation failure.
result
The status of the data transformation of the record. The possible values are:
• Ok (the record was transformed successfully),
• Dropped (the record was dropped intentionally by your processing logic), and
• ProcessingFailed (the record could not be transformed).
If a record has a status of Ok or Dropped, Kinesis Data Firehose considers it successfully processed.
Otherwise, Kinesis Data Firehose considers it unsuccessfully processed.
data
The transformed data payload, after base64-encoding.
Source: https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html
@arafkarsh arafkarsh
Kinesis
Firehose
Lambda Config
41
Example Source:
https://github.com/MetaArivu/Kinesis-Quickstart
You can Control the Stream
• Batch Size
• Batch Window in Seconds
• Max Retry
@arafkarsh arafkarsh
Kinesis
Firehose
Lambda
42
Example Source:
https://github.com/MetaArivu/Kinesis-Quickstart
@arafkarsh arafkarsh
Kinesis Data Firehose – S3
43
@arafkarsh arafkarsh
Kinesis Data Firehose – S3
44
@arafkarsh arafkarsh
Kinesis Data Firehose – Direct Input
45
@arafkarsh arafkarsh
Kinesis Data Analytics
• K
46
Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
@arafkarsh arafkarsh
Kinesis Data Analytics
• Kinesis Data Analytics is used to Analyze the Streaming Data
• Reduces the complexity in building and deploying Analytics Applications
• Provides built-in Functions to Filter, Aggregate & Transform Streaming Data
• Serverless Architecture
• Under the hood its Apache Flink (v1.13) – December 2021
INPUT
Kinesis Data Stream
Kinesis Data
Analytics
OUTPUT
Kinesis Data Stream
47
@arafkarsh arafkarsh
Kinesis Data Analytics – Architecture (Flink)
AWS Cloud
Kinesis Data Analytics
Elastic Kubernetes Service
Job Manager
Task Manager
Task Manager Task Manager
S3 Bucket
Auto Scaling
Zookeeper
Cloud Watch
Cloud Watch Logs
Flink Web UI
48
@arafkarsh arafkarsh
Kinesis Data Analytics
49
@arafkarsh arafkarsh
Kinesis Data Analytics
50
@arafkarsh arafkarsh
Apache Flink
Open-Source Stream Processing Framework
51
3
@arafkarsh arafkarsh
Apache Flink
Ease of Programming Stateful Processing
High Performance Strong Data Integrity
Flexible
APIs for
Programming
Low Latency &
Horizontally
Scalable
Stores
Application
States
Exactly Once
Processing &
Consistent State
52
Is an Open-Source Stream Processing Framework
@arafkarsh arafkarsh
What is Apache Flink
53
Stateful Computations over Data Streams
Batch
Processing
Process Static &
historic Data
Data Stream
Processing
Realtime Results
from Data Streams
Event Driven
Applications
Data Driven Actions
and Services
Instead of Spark + Hadoop
@arafkarsh arafkarsh
Use Case: Periodic ETL vs Streaming CTL
54
Traditional
Periodic ETL
• External Tool
Periodically
triggers ETL
Batch Job
Batch
Processing
Process Static &
historic Data
Data Stream
Processing
Realtime Results
from Data Streams
Continuous
Streaming Data
Pipeline
• Ingestion with
Low Latency
• No Artificial
Boundaries
Streaming
App
Ingest Append
Real Time Events
Event Logs
Batch
Process
Module
Read
Write
Transactional Data
Extract, Transform, Load Capture, Transform, Load
State
Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
@arafkarsh arafkarsh
Use Case: Data Analytics
55
• Great for Ad-Hoc Queries
• Queries changes faster than data
Batch
Analytics
Stream
Analytics
Ingest
K-V Data Store
Real Time Events
Batch
Analytics
Read Write
Recorded Events
• High Performance Low Latency Result
• Data Changes faster than Queries
Analytics
App
State
State
Update
Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
@arafkarsh arafkarsh
Use Case: Event Driven Application
56
• Compute & Data Tier Architecture
• React to Process Events
• State is stored in (Remote) Database
Traditional
Application
Design
Event Driven
Application
• High Performance Low Latency Result
• Data Changes faster than Queries
Application
Read Write
Events
Trigger
Action
Ingest
Real Time Events
Application
State
Append
Periodically write
asynchronous checkpoints
in Remote Database
Event Logs
Event Logs
Trigger
Action
Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
@arafkarsh arafkarsh
Apache Flink Use Case Features
• Business, Operational, Technical App Metrics
• User Experience Metrics
Real-time Analytics
• Transform, Filter, Aggregate Streaming Data
• IoT and Application Log Analysis
Streaming ETL Applications
• Trigger Conditions and External Notifications
• Detecting Patterns / Anomaly
Stateful Event Processing
57
@arafkarsh arafkarsh
Apache Flink
Architecture
• Architecture
• Anatomy of the Flink Cluster
• Tasks, Slots & Operator Chains
• Anatomy of a Flink Program
• Flink API & Operators 58
@arafkarsh arafkarsh
Apache Flink Architecture
59
@arafkarsh arafkarsh
Deployment Model
60
Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/overview/
The Job Manager distributes the work
onto theTask Managers, where the
actual operators such as
1. sources,
2. transformations and
3. sinks
are running.
Job Manager is the name
of the central work
coordination component
of Flink.
Task Managers are
the services actually
performing the work
of a Flink job.
@arafkarsh arafkarsh
Anatomy of the Flink Cluster
61
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/
Job Manager:
Resource Manager
It is responsible for resource de-/allocation and
provisioning in a Flink cluster — it manages task slots,
which are the unit of resource scheduling in a Flink cluster.
Dispatcher
It provides a REST interface to
submit Flink applications for
execution and starts a new Job
Master for each submitted job.
Job Master
It is responsible for managing the
execution of a single JobGraph.
Multiple jobs can run simultaneously
in a Flink cluster, each having its
Job Master.
@arafkarsh arafkarsh
Job Manager HA
62
Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/overview/
Flink ships with two high availability service implementations:
• ZooKeeper: ZooKeeper HA services can be used with every
Flink cluster deployment.They require a running
ZooKeeper quorum.
• Kubernetes: Kubernetes HA services only work when
running on Kubernetes.
Flink’s high availability services encapsulate the required services
to make everything work:
• Leader election: Selecting a single leader out of a pool
of n candidates
• Service discovery: Retrieving the address of the current leader
• State persistence: Persisting state which is required for the
successor to resume the job execution (Job Graphs, user code
jars, completed checkpoints
@arafkarsh arafkarsh
Deployment Modes
63
Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/overview/
@arafkarsh arafkarsh
Task & Operator Chains
64
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/
• For distributed execution, Flink
chains operator subtasks
together into tasks
• Each task is executed by one
thread.
• Chaining operators together into
tasks is a useful optimization:
• it reduces the overhead of
thread-to-thread handover and
buffering,
• and increases overall
throughput while decreasing
latency.
T1
T2
T3
T4
T5
@arafkarsh arafkarsh
Task Slots
65
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/
• Each worker (Task Manager) is a JVM process and may
execute one or more subtasks in separate threads.
• To control how many tasks a Task Manager accepts, it
has so called task slots (at least one).
• Memory is divided equally across the slots.
• No CPU isolation across task slot.
• Having multiple slots means more subtasks share the
same JVM.
• Tasks in the same JVM share TCP connections (via
multiplexing) and heartbeat messages.
• They may also share data sets and data structures,
thus reducing the per-task overhead.
• Flink allows subtasks to share slots even if they are
subtasks of different tasks, so long as they are from
the same job.
@arafkarsh arafkarsh
Anatomy of a Flink Program
66
1. Obtain an execution
environment,
2. Load/create the initial data,
3. Specify transformations on this
data,
4. Specify where to put the
results of your computations,
5. Trigger the program execution.
Will be triggered on your local machine
or submit your program for execution on
a cluster.
Source
Transform
Transform
Sink
1
2
3
5
4
Each program consists of the same basic parts:
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/#anatomy-of-a-flink-program
@arafkarsh arafkarsh
External Components
67
Feature Description Implementation
1
High Availability
Service Provider
Flink's Job Manager can be run in high availability mode
which allows Flink to recover from Job Manager faults. In
order to failover faster, multiple standby Job Managers can
be started to act as backups.
• Zookeeper
• Kubernetes HA
2
File Storage and
Persistency
For checkpointing (recovery mechanism for streaming jobs)
Flink relies on external file storage systems
See FileSystems page.
3
Resource
Provider
Flink can be deployed through different Resource Provider
Frameworks, such as Kubernetes, YARN or Mesos.
• Kubernetes
• YARN
• Mesos
4 Metrics Storage
Flink components report internal metrics and Flink jobs can
report additional, job specific metrics as well.
See Metrics
Reporter page.
5
Application-level
data sources and
sinks
While application-level data sources and sinks are not
technically part of the deployment of Flink cluster
components, they should be considered when planning a
new Flink production deployment. Colocating frequently
used data with Flink can have significant performance
benefits
For example:
• Apache Kafka
• Amazon S3
• Amazon Kinesis
• Elastic Search
See Connectors page.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/overview/
@arafkarsh arafkarsh
Flink Scale
68
@arafkarsh arafkarsh
Flink API
69
Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/concepts/overview/
@arafkarsh arafkarsh
Apache Flink
DataStream API
• Data Source
• Operators
• Data Sink
• Generating Watermarks
70
@arafkarsh arafkarsh
DataStream
71
• A DataStream is similar to a
regular Java Collection in
terms of usage but is quite
different in some keyways.
• They are immutable,
meaning that once they are
created you cannot add or
remove elements.
• You can also not simply
inspect the elements inside
but only work on them using
the DataStream API
operations, which are also
called transformations.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/
Reading from Socket
@arafkarsh arafkarsh
Data Sources
72
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/
File-based:
• readTextFile(path) - Reads text files, i.e. files that respect the TextInputFormat specification, line-by-
line and returns them as Strings.
• readFile(fileInputFormat, path) - Reads (once) files as dictated by the specified file input format.
• readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) - This is the method called
internally by the two previous ones. It reads files in the path based on the given fileInputFormat.
Depending on the provided watchType, this source may periodically monitor (every interval ms) the
path for new data (FileProcessingMode.PROCESS_CONTINUOUSLY), or process once the data
currently in the path and exit (FileProcessingMode.PROCESS_ONCE). Using the pathFilter, the user
can further exclude files from being processed.
@arafkarsh arafkarsh
Data Sources
73
Socket-based:
• socketTextStream - Reads from a socket. Elements can be separated by a delimiter.
Collection-based:
• fromCollection(Collection) - Creates a data stream from the Java Java.util.Collection. All elements in
the collection must be of the same type.
• fromCollection(Iterator, Class) - Creates a data stream from an iterator. The class specifies the data
type of the elements returned by the iterator.
• fromElements(T ...) - Creates a data stream from the given sequence of objects. All objects must be
of the same type.
• fromParallelCollection(SplittableIterator, Class) - Creates a data stream from an iterator, in parallel.
The class specifies the data type of the elements returned by the iterator.
• generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel.
Custom:
• addSource - Attach a new source function. For example, to read from Apache Kafka you can
use addSource(new FlinkKafkaConsumer<>(...)). See connectors for more details.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/
@arafkarsh arafkarsh
Data Sources: Custom Connectors
74
1. Apache Kafka (source/sink)
2. Apache Cassandra (sink)
3. Amazon Kinesis Streams (source/sink)
4. Elasticsearch (sink)
5. FileSystem (Hadoop included) - Streaming only sink (sink)
6. FileSystem (Hadoop included) - Streaming and Batch sink (sink)
7. [FileSystem (Hadoop included) - Batch source]
(//nightlies.apache.org/flink/flink-docs-
release1.14/docs/connectors/datastream/formats/) (source)
8. RabbitMQ (source/sink)
9. Google PubSub (source/sink)
10. Hybrid Source (source)
11. Apache NiFi (source/sink)
12. Apache Pulsar (source)
13. Twitter Streaming API (source)
14. JDBC (sink)
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/overview/
Bundled Connectors
1. Apache ActiveMQ (source/sink)
2. Apache Flume (sink)
3. Redis (sink)
4. Akka (sink)
5. Netty (source)
Apache Bahir
@arafkarsh arafkarsh
Data Sink
75
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/
• writeAsText() / TextOutputFormat - Writes elements line-wise as Strings. The Strings are obtained
by calling the toString() method of each element.
• writeAsCsv(...) / CsvOutputFormat - Writes tuples as comma-separated value files. Row and field
delimiters are configurable. The value for each field comes from the toString() method of the
objects.
• print() / printToErr() - Prints the toString() value of each element on the standard out / standard
error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This
can help to distinguish between different calls to print. If the parallelism is greater than 1, the
output will also be prepended with the identifier of the task which produced the output.
• writeUsingOutputFormat() / FileOutputFormat - Method and base class for custom file outputs.
Supports custom object-to-bytes conversion.
• writeToSocket - Writes elements to a socket according to a SerializationSchema
• addSink - Invokes a custom sink function. Flink comes bundled with connectors to other systems
(such as Apache Kafka) that are implemented as sink functions.
Data sinks consume DataStreams and forward them to files, sockets, external
systems, or print them
@arafkarsh arafkarsh
Execution Mode – Batch / Streaming
76
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/execution_mode/
The execution mode can be configured via the execution.runtime-mode setting.
There are three possible values:
1. STREAMING: The classic DataStream execution mode (default)
2. BATCH: Batch-style execution on the DataStream API
3. AUTOMATIC: Let the system decide based on the boundedness of the sources
• The BATCH execution mode can only be used for Jobs/Flink Programs that are bounded.
• Boundedness is a property of a data source that tells us whether all the input coming from that source is
known before execution or whether new data will show up, potentially indefinitely.
• A job, in turn, is bounded if all its sources are bounded, and unbounded otherwise.
• STREAMING execution mode, on the other hand, can be used for both bounded and unbounded jobs.
• As a rule of thumb, you should be using BATCH execution mode when your program is bounded because
this will be more efficient.
@arafkarsh arafkarsh
Stream Processing: Operators
77
Map
Takes one element and produces one
element. A map function that doubles the
values of the input stream:
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/
Flat Map
Takes one element and produces zero, one,
or more elements. A flatmap function that
splits sentences to words:
Filter
Evaluates a Boolean function for each
element and retains those for which the
function returns true.
Key By
Logically partitions a stream into disjoint
partitions. All records with the same key are
assigned to the same partition.
@arafkarsh arafkarsh
Stream Processing: Operators
78
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/
Reduce
A “rolling” reduce on a keyed data stream.
Combines the current element with the last
reduced value and emits the new value.
Union
Union of two or more data streams creating a
new stream containing all the elements from all
the streams.
Join Join two data streams on a given key and a
common window.
Join
Interval
Join two elements e1 and e2 of two keyed
streams with a common key over a given time
interval, so that e1.timestamp + lowerBound <=
e2.timestamp <= e1.timestamp + upperBound
@arafkarsh arafkarsh
Stream Processing: Operators
79
Window
All
Windows can be defined on regular Data
Streams. Windows group all the stream
events according to some characteristic.
Window
Apply
Applies a general function to the window
as a whole. Below is a function that
manually sums the elements of a window.
Window
Reduce
Applies a functional reduce function to
the window and returns the reduced
value.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/
Window
Windows can be defined on already
partitioned Keyed Streams. Windows
group the data in each key according to
some characteristic.
@arafkarsh arafkarsh
Watermarks
80
1. Watermarks are provided by the Data Source of Application
2. They are part of the Stream and carry a timestamp
3. A Watermark asserts that all earlier events have probably arrived
• Watermark w9 asserts that all the events with time < w9 has arrived.
• Watermark w15 asserts that all the events with time < w15 has arrived.
27
Event Stream
25
13 21 4 10 13 12 15 8 7 11 1 3
w9
w15 w5
18
w21
Event Timestamp
Watermarks
Late Events
@arafkarsh arafkarsh
Goal: Count Events in 10 Seconds Windows
81
0 – 10
Seconds
11 – 20
Seconds
21 – 30
Seconds
8 7 11 1 3
15
13 12
10
4
18
13
21
27
Event Stream
25
13 21 4 10 13 12 15 8 7 11 1 3
w9
w15 w5
18
w21
Event Timestamp
Watermarks
Late Events
27 25
R1 R2
R1 R2
Event Time Timers
@arafkarsh arafkarsh
Allowed Lateness
82
• Once a window is fired it’s
state is freed & all the late
events are dropped.
• You can avoid the dropping of
the late events by configuring
the max time to wait for the
late events.
• With Sufficient lateness
allowed Event [4] and [13] are
updated in the respective
window and result is updated
(R2)
stream.window(<window assigner>).allowedLateness(<timer>)
@arafkarsh arafkarsh
Timers
83
Explicit
o TimerService timerService = context.timerService();
o timerService.registerEventTimeTimer(event.timestamp); // Time In Millis
o timerService.registerProcessingTimeTimer(event.timestamp); // Time In Millis
Implicit
o stream.window(TumblingEventTimeWindows.of(Time.seconds(7)))
o stream.window(TumblingProcessingTimeWindows.of(Time.seconds(7)))
o SELECT user, SUM(amount)
o FROM Orders
o GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR), user
Source: Streaming Concepts & Introduction – Feb 1, 2021: https://www.youtube.com/watch?v=QVDJFZVHZ3c
@arafkarsh arafkarsh
Watermarks – In Order Events
84
Watermarks:
• To measure progress in event time.
• It flow as part of the data stream and carry a timestamp t.
• A Watermark(t) declares that event time has reached time t in that stream.
• Meaning that there should be no more elements from the stream with a timestamp t' <= t (i.e.
events with timestamps older or equal to the watermark).
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
@arafkarsh arafkarsh
Watermarks – Out of Order Events
85
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
• A watermark is a declaration that by that point in the stream, all events up to a
certain timestamp should have arrived.
• Once a watermark reaches an operator, the operator can advance its internal event
time clock to the value of the watermark.
@arafkarsh arafkarsh
Watermarks – in Parallel Streams
86
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
• Watermarks are
generated at, or directly
after, source functions.
• Each parallel subtask of
a source function
usually generates its
watermarks
independently.
• These watermarks
define the event time at
that particular parallel
source.
@arafkarsh arafkarsh
Generating Watermarks
87
In order to work with event time, Flink needs to know the events timestamps, meaning each element in the
stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp
from some field in the element by using a Timestamp Assigner.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
Specifying a Timestamp Assigner is optional, and, in most cases, you don’t actually want to specify one. For
example, when using Kafka or Kinesis you would get timestamps directly from the Kafka/Kinesis records.
Idle Input Source
If one of the input splits/partitions/shards does not carry events for a while this means that
the Watermark Generator also does not get any new information on which to base a watermark.
To deal with this, you can use a Watermark Strategy that will detect idleness and mark an input as idle.
@arafkarsh arafkarsh
Watermark Strategies
88
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
There are two places in Flink applications where
a Watermark Strategy can be used:
1. directly on sources and (RECOMMENDED)
2. after non-source operation.
The first option is preferable, because
• it allows sources to exploit knowledge about
shards/partitions/splits in the watermarking logic.
• Sources can usually then track watermarks at a finer
level and
• the overall watermark produced by a source will be
more accurate.
The second option (setting a Watermark Strategy after
arbitrary operations) should only be used if you cannot
set a strategy directly on the source.
After
non-source
operation
@arafkarsh arafkarsh
Periodic Watermark Generator
89
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
A periodic generator observes stream events and generates watermarks periodically (possibly depending on the
stream elements, or purely based on processing time).
@arafkarsh arafkarsh
Punctuated Watermark Generator
90
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
A punctuated watermark generator will observe the stream of events and emit a watermark whenever it sees a
special element that carries watermark information.
@arafkarsh arafkarsh
Watermark Summary
91
• Flink Supports different Types of Time
• Event Time
• Processing Time
• With Event Time
• Events can be out of Order
• Expect Deterministic Results
• Event time Applications are Responsible for
• Providing Watermarks
• Deciding how to handle late events
• Streaming Applications must trade off Completeness for Latency
• Can wait longer to have more complete information before acting
• Can wait less to reduce latency
• Watermarks are the mechanism for managing this trade off
Source: https://www.youtube.com/watch?v=QVDJFZVHZ3c
@arafkarsh arafkarsh
Core
Building
Blocks
• Event Time
• Event Streams
• State
• Snapshots 92
@arafkarsh arafkarsh
Flink Core Building Blocks
93
Event Streams
Real-time &
hindsight
State
Complex
Business Logic
Consistency
with out-of-
order data &
Late data
Event Time Snapshots
Forking /
versioning /
Time Travel
Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw
@arafkarsh arafkarsh
Flink API Architecture (v1.14)
94
Table / SQL API
Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw
Relational Planner
DataStream API Stateful Functions
Internal Streams API
Runtime
@arafkarsh arafkarsh 95
Consistency
with out-of-
order data &
Late data
Event Time
@arafkarsh arafkarsh
Handling Time
96
Partition 2
Partition 1
Partition 3
Messaging Layer
Kafka / Kinesis Data Streams
Event Time Broker Time
Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/concepts/time/
Event
Producer
Flink
Data Source
Flink
Window Operator
[ ]
[ ]
Processing Time
Ingestion Time
@arafkarsh arafkarsh
Handling Event Time
97
• Can Ensure Ordering of Event
Time
• Increases Latency for Ordered
Event Time
• Flink Reconstruct the order
Event time:
Event time is the time that each individual event occurred on its producing device.
Processing time:
Processing time refers to the system time of the machine that is executing the
respective operation.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
@arafkarsh arafkarsh
Windows
98
• Windows are at the heart of processing infinite streams.
• Windows split the stream into “buckets” of finite size, over which we can apply computations.
• It is created as soon as the first element that should belong to this window arrives, and the
• Window is completely removed when the time (event or processing time) passes its end
timestamp plus the user-specified allowed lateness.
• Flink guarantees removal only for time-based windows.
• 2 Category of Windows – Keyed keyBy(…) and non-Keyed Windows windowAll(…)
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
Types of Windows
1. Time Windows
2. Count Windows
@arafkarsh arafkarsh
Window (Sliding, Tumbling, Hopping, Session)
99
Source: https://docs.microsoft.com/en-us/stream-analytics-query/windowing-azure-stream-analytics
Sliding Tumbling
Hopping
Session
@arafkarsh arafkarsh
Window – Tumbling
100
Tumbling windows have a fixed size and do not overlap.
• Without offsets hourly tumbling windows are aligned with epoch, that is you will get windows such as
• 1:00:00.000 - 1:59:59.999, 2:00:00.000 - 2:59:59.999 and so on.
• Offset of 15 minutes you would, for example, get 1:15:00.000 - 2:14:59.999.
• An important use case for offsets is to adjust windows to time zones other than UTC-0.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
@arafkarsh arafkarsh
Window – Sliding
101
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
You could have windows of size 10 minutes that
slides by 5 minutes.With this you get every 5
minutes a window that contains the events that
arrived during the last 10 minutes
@arafkarsh arafkarsh
Window – Session
102
• The session windows groups elements
by sessions of activity.
• Session windows do not overlap and do
not have a fixed start and end time.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
@arafkarsh arafkarsh
Window – Global
103
• This windowing scheme is only useful if you also specify a custom trigger.
• Otherwise, no computation will be performed, as the global window does not have a natural end
at which we could process the aggregated elements.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
@arafkarsh arafkarsh
Window Functions
104
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
Reduce Function
Aggregate Function
Process Window Function
Process Window
Function with
Incremental Aggregation
• Window functions are used to specify the computation needs to happen on the window.
• This is done when a Window is ready for Processing.
• Triggers are used to determine when the Window is ready for Computation.
The window function can be one of Reduce
Function, Aggregate Function, or Process Window Function.
The Reduce Function, Aggregate Function can be executed
more efficiently because Flink can incrementally aggregate
the elements for each window as they arrive.
A Process Window Function gets an Iterable for all the
elements contained in a window and additional meta
information about the window to which the elements belong.
@arafkarsh arafkarsh
Reduce Function
105
A Reduce Function specifies how two elements from the input are combined to
produce an output element of the same type.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
@arafkarsh arafkarsh
Aggregate Function
106
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
An Aggregate Function is a generalized version of
a Reduce Function that has three types:
1. an input type (IN),
2. accumulator type (ACC),
3. and an output type (OUT).
The input type is the type of elements in the input
stream and the Aggregate Function has a method
for adding one input element to an accumulator.
The interface also has methods for
1. creating an initial accumulator,
2. for merging two accumulators into one
accumulator and for
3. extracting an output (of type OUT) from an
accumulator.
Same as with Reduce Function, Flink will
incrementally aggregate input elements of a
window as they arrive.
@arafkarsh arafkarsh
Process Function
107
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
A Process Window Function gets an Iterable containing all the elements of the window, and a
Context object with access to time and state information, which enables it to provide more
flexibility than other window functions.
@arafkarsh arafkarsh
Process Function with Incremental Aggregation
108
A Process Window Function can be combined with either a Reduce Function, or an Aggregate Function to
incrementally aggregate elements as they arrive in the window. When the window is closed, the Process
Window Function will be provided with the aggregated result.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
@arafkarsh arafkarsh
Trigger
109
• A Trigger determines when a window (as formed by the window assigner) is
ready to be processed by the window function.
• It comes with a default Trigger.
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
1. The onElement() method is called for each element that is added to a window.
2. The onEventTime() method is called when a registered event-time timer fires.
3. The onProcessingTime() method is called when a registered processing-time
timer fires.
4. The onMerge() method is relevant for stateful triggers and merges the states of
two triggers when their corresponding windows merge, e.g. when using session
windows.
5. The clear() method performs any action needed upon removal of the
corresponding window.
@arafkarsh arafkarsh
Evictor
110
Flink’s windowing model allows specifying an optional Evictor in addition to the Window Assigner and
the Trigger.This can be done using the evictor(...) method (shown in the beginning of this document).The
evictor has the ability to remove elements from a window after the trigger fires and before and/or after the
window function is applied.
Flink comes with three pre-implemented evictors.These are:
• Count Evictor: keeps up to a user-specified number of elements from the window and discards the
remaining ones from the beginning of the window buffer.
• Delta Evictor: takes a Delta Function and a threshold, computes the delta between the last element in the
window buffer and each of the remaining ones, and removes the ones with a delta greater or equal to the
threshold.
• Time Evictor: takes as argument an interval in milliseconds and for a given window, it finds the maximum
timestamp max_ts among its elements and removes all the elements with timestamps smaller
than max_ts - interval.
• By default, all the pre-implemented evictors apply their logic before the window function
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
@arafkarsh arafkarsh
Handling Late Events
111
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/#allowed-lateness
• By default, the allowed lateness is set to 0.
• That is, elements that arrive behind the watermark will be dropped.
@arafkarsh arafkarsh
Late Events – Side Out
112
Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
Using Flink’s side output feature you can get a stream of the data that was discarded
as late.
@arafkarsh arafkarsh 113
Event Streams
Real-time &
hindsight
State
Complex
Business Logic
@arafkarsh arafkarsh
Streams & Batch Processing
114
• Processes “unbounded” (stream) and “bounded” (batch) data
• Processes recorded (offline) and live (real-time) data
• Batch is just a special case of streaming data
Event Log
Bounded Stream Bounded Stream
now Unbounded Stream
Unbounded Stream
Start
of the
Stream
Past Future
Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw
@arafkarsh arafkarsh
Stateful Event & Stream Processing
115
Source
Transform
Transform
Sink
Source Transform Window Sink
Streaming
Data Flow
@arafkarsh arafkarsh
Stateful Event & Stream Processing
116
Source Filter &
Transform
Window
State Read & Write Sink
1
2
keyBy(R1, R3, R5)
keyBy(R2, R4, R6)
Scalable
Local State
Scalable
Local State
keyBy()
keyBy()
High Performance
In Memory Computing &
Parallelize the Tasks
Raw Events
Raw Events
New Aggregated Event
External Storage
For Snapshots
@arafkarsh arafkarsh 117
Snapshots
Forking /
versioning /
Time Travel
@arafkarsh arafkarsh
Storage for States
118
Processor
State
External
External Storage
Processor
State
Snapshots
Internal Storage
Internal
• Independent from Processing
• Low Performance due to remote Storage
• Hard to get ”Exactly-Once” guarantees
• Highly consistent distributed Snapshotting
• Faster access with Local Storage
• Stream processor needs to handle scaling and
storage
@arafkarsh arafkarsh
Checkpoint Barrier
119
Source Filter &
Transform
Window
State Read & Write Sink
1
2
keyBy(R1, R3, R5)
keyBy(R2, R4, R6)
keyBy()
keyBy()
Checkpoint Barrier
Stream partition
Record offset
Local State
Local State
Fault Tolerant Storage
HDFS, S3, NFS…
Snapshots
Trigger Checkpoint
via RPC from Job
Manager
Aggregate,
Count etc.,
Aggregate,
Count etc.,
Record offset
Aggregate, Count etc.,
Message
Shards / Partitions
@arafkarsh arafkarsh
Snapshot Alignment
120
Source Filter &
Transform
Window
State Read & Write
1
2
keyBy()
keyBy()
Checkpoint Barrier
Stream partition
Record offset
Local State
Local State
Fault Tolerant Storage
HDFS, S3, NFS…
Snapshots
Trigger Checkpoint
via RPC from Job
Manager
Aggregate,
Count etc.,
Aggregate,
Count etc.,
Record offset
Aggregate, Count etc.,
Message
Shards / Partitions
@arafkarsh arafkarsh
Snapshot Alignment
121
Source Filter &
Transform
Window
State Read & Write
1
2
keyBy()
keyBy()
Checkpoint Barrier
Stream partition
Record offset
Local State
Local State
Fault Tolerant Storage
HDFS, S3, NFS…
Snapshots
Trigger Checkpoint
via RPC from Job
Manager
Aggregate,
Count etc.,
Aggregate,
Count etc.,
Record offset
Aggregate, Count etc.,
Message
Shards / Partitions
@arafkarsh arafkarsh
Snapshots & Fault Tolerance
122
Source Filter &
Transform
Window
State Read & Write Sink
1
2
keyBy(R1, R3, R5)
keyBy(R2, R4, R6)
Local
Storage
Local
Storage
keyBy()
keyBy()
Reload State
Reset Positions
in Input Stream
Rolling back Computation
Re-Processing
Fault Tolerant Storage
HDFS, S3, NFS…
Snapshots
@arafkarsh arafkarsh
Configure Checkpoints – Local Storage
123
Processor
State
Snapshots
HashMap State Backend
• Store the state in Memory (HashMap)
• Faster access with Memory Storage
• Subject to Garbage Collection
Processor
State
Snapshots
RocksDB State Backend
• Stores the state in Local RocksDB
• Limited only by Local Disk Size
• Slower than Memory Storage (10x Slower)
• Serialize on write and DeSerialize on Read
RocksDB
Key Value
Storage
• Jobs with large state, long
windows, large key/value
states.
• All high-availability setups
• Jobs with large state, long
windows, large key/value
states.
• All high-availability setups
@arafkarsh arafkarsh
Integration & Comparisons
• Integration
• Comparison with Spark
124
@arafkarsh arafkarsh
Integrations
125
• Event Logs
• Kafka, AWS Kinesis, Pulsar
• File Systems
• HDFS, NFS, S3, MapR FS…
• Databases
• JDBC, Hcatalog
• Encodings
• Avro, JSON, CSV, Parquet, ORC
• Key Value Stores
• Redis, Cassandra, Elastic Search
@arafkarsh arafkarsh
Apache Flink Vs Apache Spark
126
Features Flink Spark
1 Developed in Java Scala
2 Streaming Model Windowing & Checkpoints Micro batching
3 Real Time Processing Real time Processing Near Real time
4 Models Data Stream / Table SQL RDD
5 Performance High Medium
6 Supported Languages Java, Scala, Python, SQL Java, Scala, Python, R, SQL
7 SQL Analytics Yes Yes
8 Runs on Hadoop, Mesos, Kubernetes,
AWS Kinesis, ….
Hadoop, Mesos, Kubernetes
AWS EMR
9 Machine Learning Yes - FlinkML Yes
FlinkML: https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/libs/ml/index.html
@arafkarsh arafkarsh
Flink Summary
127
1. Distributed and Fault Tolerant
2. Stateful, No DB Needed
3. Horizontally Scalable
4. Parallel Execution, No
Concurrency Issues
@arafkarsh arafkarsh
Case Studies
1. HP Ink Cartridge Manufacturing Process
2. Infor: Compliance Violation (Banking)
3. Biogen: Centralized Log Management
4. Viber: Massive Data Handling - 300 Msgs / Second
5. AWS: IoT Data using Firehose and Data Analytics
6. Nordstrom: Ledger with Multi Data Views
128
4
@arafkarsh arafkarsh
HP: Ink Cartridge Manufacturing Process
• From Factory Data comes Kinesis
• Using Lambda’s Data is stored in
DynamoDB (Sequential Ops)
• Firehose stores Raw Data in S3
• Enriched Data is stored in
Aurora, Elastic Search and S3
• Glue is used for Batch Process
Source: https://www.youtube.com/watch?v=KM5ONS2fnG0
129
@arafkarsh arafkarsh
Infor: Compliance Violation Realtime / Batch
• Security & Tx Data is sent to Kinesis Data
Stream
• Services in Fargate picks up the data from
KDS send to Aurora & S3
• Scheduler (5) invokes service to EMR
processing.
• EMR fetch data from Aurora & S3 and
sends data to Event bridge
• Event Bridge (10) sent data to SQS
• Service in Fargate picks up the data from
SQS and sends out email.
Source: https://www.youtube.com/watch?v=0gNMEyei-co
130
@arafkarsh arafkarsh
Biogen: Centralized Log Management
• Application, Network and VPC Logs
sent to Kinesis Firehose
• Firehose (4) sends data to Lambda
• Lambda (5) Enrich / Normalize the
data and stores in S3
• Lambda (7)npicks up the data from
S3 and stores in Elastic Search
• Kibana is used for Data
Visualization.
Source: https://www.youtube.com/watch?v=m8xtR3-ZQs8
131
@arafkarsh arafkarsh
Viber: Massive Data Lakes 300k Msgs / Second
• From Viber BE events are batched
and send to Kinesis.
• Using KCL in Apache Storm Events
are picked from Kinesis and using
Firehose Events are stored in S3
• Aggregated Data is Sent to
another Kinesis Stream and using
a Lambda the event is send in
Viber BE based on Rules.
Source: https://www.youtube.com/watch?v=7i1tj59pvYw
EMR – Elastic Map Reduce
132
@arafkarsh arafkarsh
Nordstrom: Ledger with Multi Data Views
• Customer Data is stored in
Kinesis Data Stream as Raw Data
(Ledger)
• Firehose Stores (4) Raw Data in
S3 Bucket
• Lambda (5.1-5.3) Transforms and
stores data in different DB in
different formats for various
Read usages.
Source: https://www.youtube.com/watch?v=O7PTtm_3Os4
133
@arafkarsh arafkarsh
AWS: IoT Data – Firehose – Analytics – DynamoDB
• MQTT based Data from IoT
• Firehose stores the data in S3
• Kinesis DA get the data from
Firehose analyze it and stores
send to Firehose to store in S3
• Using Lambda the data is
enriched and stored in
DynamoDB
• Using Web Based App user gets
the data from DynamoDB
Source: https://www.youtube.com/watch?v=uWUAcc68MWI
134
@arafkarsh arafkarsh 135
Design Patterns are
solutions to general
problems that
software developers
faced during software
development.
Design Patterns
@arafkarsh arafkarsh 136
DREAM | AUTOMATE | EMPOWER
Araf Karsh Hamid :
India: +91.999.545.8627
http://www.slideshare.net/arafkarsh
https://www.linkedin.com/in/arafkarsh/
https://www.youtube.com/user/arafkarsh/playlists
http://www.arafkarsh.com/
@arafkarsh
arafkarsh
@arafkarsh arafkarsh 137
Source Code: https://github.com/MetaArivu Web Site: https://metarivu.com/ https://pyxida.cloud/
@arafkarsh arafkarsh 138
http://www.slideshare.net/arafkarsh
@arafkarsh arafkarsh
References
1. July 15, 2015 – Agile is Dead : GoTo 2015 By Dave Thomas
2. Apr 7, 2016 - Agile Project Management with Kanban | Eric Brechner | Talks at Google
3. Sep 27, 2017 - Scrum vs Kanban - Two Agile Teams Go Head-to-Head
4. Feb 17, 2019 - Lean vs Agile vs Design Thinking
5. Dec 17, 2020 - Scrum vs Kanban | Differences & Similarities Between Scrum & Kanban
6. Feb 24, 2021 - Agile Methodology Tutorial for Beginners | Jira Tutorial | Agile Methodology Explained.
Agile Methodologies
139
@arafkarsh arafkarsh
References
1. Vmware: What is Cloud Architecture?
2. Redhat: What is Cloud Architecture?
3. Cloud Computing Architecture
4. Cloud Adoption Essentials:
5. Google: Hybrid and Multi Cloud
6. IBM: Hybrid Cloud Architecture Intro
7. IBM: Hybrid Cloud Architecture: Part 1
8. IBM: Hybrid Cloud Architecture: Part 2
9. Cloud Computing Basics: IaaS, PaaS, SaaS
140
1. IBM: IaaS Explained
2. IBM: PaaS Explained
3. IBM: SaaS Explained
4. IBM: FaaS Explained
5. IBM: What is Hypervisor?
Cloud Architecture
@arafkarsh arafkarsh
References
Microservices
1. Microservices Definition by Martin Fowler
2. When to use Microservices By Martin Fowler
3. GoTo: Sep 3, 2020: When to use Microservices By Martin Fowler
4. GoTo: Feb 26, 2020: Monolith Decomposition Pattern
5. Thought Works: Microservices in a Nutshell
6. Microservices Prerequisites
7. What do you mean by Event Driven?
8. Understanding Event Driven Design Patterns for Microservices
141
@arafkarsh arafkarsh
References – Microservices – Videos
142
1. Martin Fowler – Micro Services : https://www.youtube.com/watch?v=2yko4TbC8cI&feature=youtu.be&t=15m53s
2. GOTO 2016 – Microservices at NetFlix Scale: Principles, Tradeoffs & Lessons Learned. By R Meshenberg
3. Mastering Chaos – A NetFlix Guide to Microservices. By Josh Evans
4. GOTO 2015 – Challenges Implementing Micro Services By Fred George
5. GOTO 2016 – From Monolith to Microservices at Zalando. By Rodrigue Scaefer
6. GOTO 2015 – Microservices @ Spotify. By Kevin Goldsmith
7. Modelling Microservices @ Spotify : https://www.youtube.com/watch?v=7XDA044tl8k
8. GOTO 2015 – DDD & Microservices: At last, Some Boundaries By Eric Evans
9. GOTO 2016 – What I wish I had known before Scaling Uber to 1000 Services. By Matt Ranney
10. DDD Europe – Tackling Complexity in the Heart of Software By Eric Evans, April 11, 2016
11. AWS re:Invent 2016 – From Monolithic to Microservices: Evolving Architecture Patterns. By Emerson L, Gilt D. Chiles
12. AWS 2017 – An overview of designing Microservices based Applications on AWS. By Peter Dalbhanjan
13. GOTO Jun, 2017 – Effective Microservices in a Data Centric World. By Randy Shoup.
14. GOTO July, 2017 – The Seven (more) Deadly Sins of Microservices. By Daniel Bryant
15. Sept, 2017 – Airbnb, From Monolith to Microservices: How to scale your Architecture. By Melanie Cubula
16. GOTO Sept, 2017 – Rethinking Microservices with Stateful Streams. By Ben Stopford.
17. GOTO 2017 – Microservices without Servers. By Glynn Bird.
@arafkarsh arafkarsh
References
143
Domain Driven Design
1. Oct 27, 2012 What I have learned about DDD Since the book. By Eric Evans
2. Mar 19, 2013 Domain Driven Design By Eric Evans
3. Jun 02, 2015 Applied DDD in Java EE 7 and Open Source World
4. Aug 23, 2016 Domain Driven Design the Good Parts By Jimmy Bogard
5. Sep 22, 2016 GOTO 2015 – DDD & REST Domain Driven API’s for the Web. By Oliver Gierke
6. Jan 24, 2017 Spring Developer – Developing Micro Services with Aggregates. By Chris Richardson
7. May 17. 2017 DEVOXX – The Art of Discovering Bounded Contexts. By Nick Tune
8. Dec 21, 2019 What is DDD - Eric Evans - DDD Europe 2019. By Eric Evans
9. Oct 2, 2020 - Bounded Contexts - Eric Evans - DDD Europe 2020. By. Eric Evans
10. Oct 2, 2020 - DDD By Example - Paul Rayner - DDD Europe 2020. By Paul Rayner
@arafkarsh arafkarsh
References
Event Sourcing and CQRS
1. IBM: Event Driven Architecture – Mar 21, 2021
2. Martin Fowler: Event Driven Architecture – GOTO 2017
3. Greg Young: A Decade of DDD, Event Sourcing & CQRS – April 11, 2016
4. Nov 13, 2014 GOTO 2014 – Event Sourcing. By Greg Young
5. Mar 22, 2016 Building Micro Services with Event Sourcing and CQRS
6. Apr 15, 2016 YOW! Nights – Event Sourcing. By Martin Fowler
7. May 08, 2017 When Micro Services Meet Event Sourcing. By Vinicius Gomes
144
@arafkarsh arafkarsh
References
145
Kafka
1. Understanding Kafka
2. Understanding RabbitMQ
3. IBM: Apache Kafka – Sept 18, 2020
4. Confluent: Apache Kafka Fundamentals – April 25, 2020
5. Confluent: How Kafka Works – Aug 25, 2020
6. Confluent: How to integrate Kafka into your environment – Aug 25, 2020
7. Kafka Streams – Sept 4, 2021
8. Kafka: Processing Streaming Data with KSQL – Jul 16, 2018
9. Kafka: Processing Streaming Data with KSQL – Nov 28, 2019
@arafkarsh arafkarsh
References
Databases: Big Data / Cloud Databases
1. Google: How to Choose the right database?
2. AWS: Choosing the right Database
3. IBM: NoSQL Vs. SQL
4. A Guide to NoSQL Databases
5. How does NoSQL Databases Work?
6. What is Better? SQL or NoSQL?
7. What is DBaaS?
8. NoSQL Concepts
9. Key Value Databases
10. Document Databases
11. Jun 29, 2012 – Google I/O 2012 - SQL vs NoSQL: Battle of the Backends
12. Feb 19, 2013 - Introduction to NoSQL • Martin Fowler • GOTO 2012
13. Jul 25, 2018 - SQL vs NoSQL or MySQL vs MongoDB
14. Oct 30, 2020 - Column vs Row Oriented Databases Explained
15. Dec 9, 2020 - How do NoSQL databases work? Simply Explained!
1. Graph Databases
2. Column Databases
3. Row Vs. Column Oriented Databases
4. Database Indexing Explained
5. MongoDB Indexing
6. AWS: DynamoDB Global Indexing
7. AWS: DynamoDB Local Indexing
8. Google Cloud Spanner
9. AWS: DynamoDB Design Patterns
10. Cloud Provider Database Comparisons
11. CockroachDB: When to use a Cloud DB?
146
@arafkarsh arafkarsh
References
Docker / Kubernetes / Istio
1. IBM: Virtual Machines and Containers
2. IBM: What is a Hypervisor?
3. IBM: Docker Vs. Kubernetes
4. IBM: Containerization Explained
5. IBM: Kubernetes Explained
6. IBM: Kubernetes Ingress in 5 Minutes
7. Microsoft: How Service Mesh works in Kubernetes
8. IBM: Istio Service Mesh Explained
9. IBM: Kubernetes and OpenShift
10. IBM: Kubernetes Operators
11. 10 Consideration for Kubernetes Deployments
Istio – Metrics
1. Istio – Metrics
2. Monitoring Istio Mesh with Grafana
3. Visualize your Istio Service Mesh
4. Security and Monitoring with Istio
5. Observing Services using Prometheus, Grafana, Kiali
6. Istio Cookbook: Kiali Recipe
7. Kubernetes: Open Telemetry
8. Open Telemetry
9. How Prometheus works
10. IBM: Observability vs. Monitoring
147
@arafkarsh arafkarsh
References
148
1. Feb 6, 2020 – An introduction to TDD
2. Aug 14, 2019 – Component Software Testing
3. May 30, 2020 – What is Component Testing?
4. Apr 23, 2013 – Component Test By Martin Fowler
5. Jan 12, 2011 – Contract Testing By Martin Fowler
6. Jan 16, 2018 – Integration Testing By Martin Fowler
7. Testing Strategies in Microservices Architecture
8. Practical Test Pyramid By Ham Vocke
Testing – TDD / BDD
@arafkarsh arafkarsh 149
1. Simoorg : LinkedIn’s own failure inducer framework. It was designed to be easy to extend and
most of the important components are plug‐ gable.
2. Pumba : A chaos testing and network emulation tool for Docker.
3. Chaos Lemur : Self-hostable application to randomly destroy virtual machines in a BOSH-
managed environment, as an aid to resilience testing of high-availability systems.
4. Chaos Lambda : Randomly terminate AWS ASG instances during business hours.
5. Blockade : Docker-based utility for testing network failures and partitions in distributed
applications.
6. Chaos-http-proxy : Introduces failures into HTTP requests via a proxy server.
7. Monkey-ops : Monkey-Ops is a simple service implemented in Go, which is deployed into an
OpenShift V3.X and generates some chaos within it. Monkey-Ops seeks some OpenShift
components like Pods or Deployment Configs and randomly terminates them.
8. Chaos Dingo : Chaos Dingo currently supports performing operations on Azure VMs and VMSS
deployed to an Azure Resource Manager-based resource group.
9. Tugbot : Testing in Production (TiP) framework for Docker.
Testing tools
@arafkarsh arafkarsh
References
CI / CD
1. What is Continuous Integration?
2. What is Continuous Delivery?
3. CI / CD Pipeline
4. What is CI / CD Pipeline?
5. CI / CD Explained
6. CI / CD Pipeline using Java Example Part 1
7. CI / CD Pipeline using Ansible Part 2
8. Declarative Pipeline vs Scripted Pipeline
9. Complete Jenkins Pipeline Tutorial
10. Common Pipeline Mistakes
11. CI / CD for a Docker Application
150
@arafkarsh arafkarsh
References
151
DevOps
1. IBM: What is DevOps?
2. IBM: Cloud Native DevOps Explained
3. IBM: Application Transformation
4. IBM: Virtualization Explained
5. What is DevOps? Easy Way
6. DevOps?! How to become a DevOps Engineer???
7. Amazon: https://www.youtube.com/watch?v=mBU3AJ3j1rg
8. NetFlix: https://www.youtube.com/watch?v=UTKIT6STSVM
9. DevOps and SRE: https://www.youtube.com/watch?v=uTEL8Ff1Zvk
10. SLI, SLO, SLA : https://www.youtube.com/watch?v=tEylFyxbDLE
11. DevOps and SRE : Risks and Budgets : https://www.youtube.com/watch?v=y2ILKr8kCJU
12. SRE @ Google: https://www.youtube.com/watch?v=d2wn_E1jxn4
@arafkarsh arafkarsh
References
152
1. Lewis, James, and Martin Fowler. “Microservices: A Definition of This New Architectural Term”, March 25, 2014.
2. Miller, Matt. “Innovate or Die: The Rise of Microservices”. e Wall Street Journal, October 5, 2015.
3. Newman, Sam. Building Microservices. O’Reilly Media, 2015.
4. Alagarasan, Vijay. “Seven Microservices Anti-patterns”, August 24, 2015.
5. Cockcroft, Adrian. “State of the Art in Microservices”, December 4, 2014.
6. Fowler, Martin. “Microservice Prerequisites”, August 28, 2014.
7. Fowler, Martin. “Microservice Tradeoffs”, July 1, 2015.
8. Humble, Jez. “Four Principles of Low-Risk Software Release”, February 16, 2012.
9. Zuul Edge Server, Ketan Gote, May 22, 2017
10. Ribbon, Hysterix using Spring Feign, Ketan Gote, May 22, 2017
11. Eureka Server with Spring Cloud, Ketan Gote, May 22, 2017
12. Apache Kafka, A Distributed Streaming Platform, Ketan Gote, May 20, 2017
13. Functional Reactive Programming, Araf Karsh Hamid, August 7, 2016
14. Enterprise Software Architectures, Araf Karsh Hamid, July 30, 2016
15. Docker and Linux Containers, Araf Karsh Hamid, April 28, 2015
@arafkarsh arafkarsh
References
153
16. MSDN – Microsoft https://msdn.microsoft.com/en-us/library/dn568103.aspx
17. Martin Fowler : CQRS – http://martinfowler.com/bliki/CQRS.html
18. Udi Dahan : CQRS – http://www.udidahan.com/2009/12/09/clarified-cqrs/
19. Greg Young : CQRS - https://www.youtube.com/watch?v=JHGkaShoyNs
20. Bertrand Meyer – CQS - http://en.wikipedia.org/wiki/Bertrand_Meyer
21. CQS : http://en.wikipedia.org/wiki/Command–query_separation
22. CAP Theorem : http://en.wikipedia.org/wiki/CAP_theorem
23. CAP Theorem : http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
24. CAP 12 years how the rules have changed
25. EBay Scalability Best Practices : http://www.infoq.com/articles/ebay-scalability-best-practices
26. Pat Helland (Amazon) : Life beyond distributed transactions
27. Stanford University: Rx https://www.youtube.com/watch?v=y9xudo3C1Cw
28. Princeton University: SAGAS (1987) Hector Garcia Molina / Kenneth Salem
29. Rx Observable : https://dzone.com/articles/using-rx-java-observable

Contenu connexe

Tendances

Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Kai Wähner
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
Kai Wähner
 

Tendances (20)

Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Microservices Testing Strategies JUnit Cucumber Mockito Pact
Microservices Testing Strategies JUnit Cucumber Mockito PactMicroservices Testing Strategies JUnit Cucumber Mockito Pact
Microservices Testing Strategies JUnit Cucumber Mockito Pact
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven Design
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache KafkaTop 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
 
AWS-Data-Migration-module3
AWS-Data-Migration-module3AWS-Data-Migration-module3
AWS-Data-Migration-module3
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel Industry
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Microservices, DevOps & SRE
Microservices, DevOps & SREMicroservices, DevOps & SRE
Microservices, DevOps & SRE
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 

Similaire à Apache Flink, AWS Kinesis, Analytics

찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 

Similaire à Apache Flink, AWS Kinesis, Analytics (20)

AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...
 
Deep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsDeep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming Applications
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practices
 
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKChoose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
 
Amazon Kinesis Data Streams Vs Msk (1).pptx
Amazon Kinesis Data Streams Vs Msk (1).pptxAmazon Kinesis Data Streams Vs Msk (1).pptx
Amazon Kinesis Data Streams Vs Msk (1).pptx
 
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
AWS Webcast - AWS Kinesis Webinar
AWS Webcast - AWS Kinesis WebinarAWS Webcast - AWS Kinesis Webinar
AWS Webcast - AWS Kinesis Webinar
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
SRV304_Building High-Throughput Serverless Data Processing Pipelines
SRV304_Building High-Throughput Serverless Data Processing PipelinesSRV304_Building High-Throughput Serverless Data Processing Pipelines
SRV304_Building High-Throughput Serverless Data Processing Pipelines
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
 

Plus de Araf Karsh Hamid

Plus de Araf Karsh Hamid (20)

Zero-Trust SASE DevSecOps
Zero-Trust SASE DevSecOpsZero-Trust SASE DevSecOps
Zero-Trust SASE DevSecOps
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - Observability
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 
Cloud Architecture - Multi Cloud, Edge, On-Premise
Cloud Architecture - Multi Cloud, Edge, On-PremiseCloud Architecture - Multi Cloud, Edge, On-Premise
Cloud Architecture - Multi Cloud, Edge, On-Premise
 
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes Istio
 
Microservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration PatternsMicroservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration Patterns
 
Microservices Architecture - Cloud Native Apps
Microservices Architecture - Cloud Native AppsMicroservices Architecture - Cloud Native Apps
Microservices Architecture - Cloud Native Apps
 
Domain Driven Design
Domain Driven Design Domain Driven Design
Domain Driven Design
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SRE
 
Microservices, Containers, Kubernetes, Kafka, Kanban
Microservices, Containers, Kubernetes, Kafka, KanbanMicroservices, Containers, Kubernetes, Kafka, Kanban
Microservices, Containers, Kubernetes, Kafka, Kanban
 
Blockchain HyperLedger Fabric Internals - Clavent
Blockchain HyperLedger Fabric Internals - ClaventBlockchain HyperLedger Fabric Internals - Clavent
Blockchain HyperLedger Fabric Internals - Clavent
 
Blockchain Intro to Hyperledger Fabric
Blockchain Intro to Hyperledger Fabric Blockchain Intro to Hyperledger Fabric
Blockchain Intro to Hyperledger Fabric
 
Microservices Architecture - Bangkok 2018
Microservices Architecture - Bangkok 2018Microservices Architecture - Bangkok 2018
Microservices Architecture - Bangkok 2018
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Microservices Architecture & Testing Strategies
Microservices Architecture & Testing StrategiesMicroservices Architecture & Testing Strategies
Microservices Architecture & Testing Strategies
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven Design
 
Microservices Architecture & Testing Strategies
Microservices Architecture & Testing StrategiesMicroservices Architecture & Testing Strategies
Microservices Architecture & Testing Strategies
 
Microservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive ProgrammingMicroservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive Programming
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Apache Flink, AWS Kinesis, Analytics

  • 1. @arafkarsh arafkarsh ARAF KARSH HAMID Co-Founder / CTO MetaMagic Global Inc., NJ, USA @arafkarsh arafkarsh Microservice Architecture Series Building Cloud Native Apps Kinesis Data Steams Kinesis Firehose Kinesis Data Analytics Apache Flink Part 3 of 11
  • 2. @arafkarsh arafkarsh 2 Slides are color coded based on the topic colors. AWS Kinesis Video Streams Data Streams 1 AWS Kinesis Data Firehose Data Analytics 2 Apache Flink Streams Table / SQL 3 Kinesis Case Studies 4
  • 3. @arafkarsh arafkarsh Agile Scrum (4-6 Weeks) Developer Journey Monolithic Domain Driven Design Event Sourcing and CQRS Waterfall Optional Design Patterns Continuous Integration (CI) 6/12 Months Enterprise Service Bus Relational Database [SQL] / NoSQL Development QA / QC Ops 3 Microservices Domain Driven Design Event Sourcing and CQRS Scrum / Kanban (1-5 Days) Mandatory Design Patterns Infrastructure Design Patterns CI DevOps Event Streaming / Replicated Logs SQL NoSQL CD Container Orchestrator Service Mesh
  • 4. @arafkarsh arafkarsh Application Modernization – 3 Transformations 4 Monolithic SOA Microservice Physical Server Virtual Machine Cloud Waterfall Agile DevOps Source: IBM: Application Modernization > https://www.youtube.com/watch?v=RJ3UQSxwGFY Architecture Infrastructure Delivery
  • 5. @arafkarsh arafkarsh Application Modernization – 3 Transformations Monolithic SOA Microservice Physical Server Virtual Machine Cloud Waterfall Agile DevOps Source: IBM: Application Modernization > https://www.youtube.com/watch?v=RJ3UQSxwGFY Architecture Infrastructure Delivery Modernization 1 2 3 5
  • 6. @arafkarsh arafkarsh Microservices Principles…. 6 Components via Services Organized around Business Capabilities Products NOT Projects Smart Endpoints & Dumb Pipes Decentralized Governance & Data Management Infrastructure Automation Design for Failure Evolutionary Design
  • 7. @arafkarsh arafkarsh AWS Kinesis • Data Streams • Video Streams 7 1 Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
  • 8. @arafkarsh arafkarsh AWS Kinesis - Purpose 1. Collect 2. Process 3. Analyze Realtime 4. Streaming Data Ingest Realtime Data 1. Video 2. Audio 3. Application Logs 4. Website Click Streams IoT Telemetry Data 1. Analytics 2. Machine Learning 8
  • 9. @arafkarsh arafkarsh AWS Kinesis Kinesis Video Streams helps you to securely stream video from systems to AWS for processing such as Analytics, Machine Learning and others. Kinesis Data Streams are a highly Scalable, Durable, & Realtime data streaming service that can capture Gigabytes of data per second different data sources. Kinesis Data Firehose is used to Extract, Load, Transform (ETL) data streams into AWS stores like S3, Redshift, Open Search etc. for near Realtime data analytics. Kinesis Data Analytics is used to process the real-time streams in SQL or Java or Python. 9
  • 10. @arafkarsh arafkarsh Streaming Data • Continuously generated Data to be processed sequentially or incrementally • Data is sent record by record by thousands or over a sliding time windows of Data Sources Use Cases Gaming Stock Market Real Estate Transport Applications 10
  • 11. @arafkarsh arafkarsh Kinesis Video Streams Devices Processing • AWS Rekognition • AWS Sage Maker • Tensor Flow • HLS Playback • Custom Video Processing • Automatically scales the infrastructure needed for streaming video data from devices • Stream video from connected devices to AWS for Analytics, Machine Learning, Playback etc. • Stores, Encrypts and indexes video data and access the data using APIs HLS – HTTP Live Streaming INPUT Kinesis Video Stream 11
  • 12. @arafkarsh arafkarsh Kinesis Data Streams Applications Processing • Kinesis Data Analytics • Spark • AWS EC2 • AWS Lambda • Kinesis Data Streams are Highly Scalable and Durable Real-time streaming • Stream Data from connected devices to AWS for Analytics, Machine Learning. etc. INPUT Kinesis Data Stream 12
  • 13. @arafkarsh arafkarsh Kinesis Data Streams: Example Applications • Raw Events are coming from Cart Checkout • Using the Lambda, the Raw Event is Enriched and send to another Stream for further processing Event Producer Kinesis Data Stream Raw Events 13 Kinesis Data Stream Enriched Events Enrich the Checkout Event IN OUT Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
  • 14. @arafkarsh arafkarsh Kinesis Data Firehose Store Data • AWS S3 • AWS Redshift • AWS Elastic Search • Splunk • Kinesis Data Firehose is to store the streaming data into Data Stores, Lakes etc. • Firehose is used to Capture, Transform and Load Data into S3, Redshift etc. Kinesis Data Stream Kinesis Data Firehose Data Transformation using Lambda 14
  • 15. @arafkarsh arafkarsh Kinesis Data Analytics • Kinesis Data Analytics is used to analyze the streaming Data • Reduces the complexity in building and deploying Analytics Applications • Provides built-in Functions to Filter, Aggregate and Transform Streaming Data • Serverless Architecture • Under the hood its Apache Flink (v1.13) INPUT Kinesis Data Stream Kinesis Data Analytics OUTPUT Kinesis Data Stream 15
  • 16. @arafkarsh arafkarsh AWS Kinesis – Summary 16 Kinesis Video Streams helps you to securely stream video from systems to AWS for processing such as Analytics, Machine Learning and others. Kinesis Data Streams are a highly Scalable, Durable, & Realtime data streaming service that can capture Gigabytes of data per second different data sources. Kinesis Data Firehose is used to Extract, Load, Transform (ETL) data streams into AWS stores like S3, Redshift, Open Search etc. for near Realtime data analytics. Kinesis Data Analytics is used to process the real-time streams in SQL or Java or Python.
  • 17. @arafkarsh arafkarsh Kinesis Data Streams Producers Consumers 17 Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
  • 18. @arafkarsh arafkarsh How it works Source: https://aws.amazon.com/kinesis/data-streams/ 18
  • 20. @arafkarsh arafkarsh Kinesis Data Streams Data Record The atomic unit of data in a Data Stream stored in Kinesis Data Stream Collection of Data Records streamed and stored in multiple shards. Data Record Data Record Data Record Data Record Data Stream Data Record Data Record Data Record Shard 1 Data Record Data Record Data Record Shard 2 Data Record Data Record Data Record Shard n Data Stream 20 Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html Producer puts the Data Records into the Shards and Consumer retrieves the data from the Shard.
  • 21. @arafkarsh arafkarsh Kinesis Data Streams: Shards 21 • A shard is a uniquely identified sequence of data records in a stream. • A stream is composed of one or more shards, each of which provides a fixed unit of capacity. • Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). • The data capacity of your stream is a function of the number of shards that you specify for the stream. Data Record Data Record Data Record Shard 1 Data Record Data Record Data Record Shard 2 Data Record Data Record Data Record Shard n Data Stream Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html • The total capacity of the stream is the sum of the capacities of its shards.
  • 22. @arafkarsh arafkarsh Kinesis Data Streams: Partition Keys 22 Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html Partition Key Data BLOB • A partition key is used to group data by shard within a stream. • Kinesis Data Streams segregates the data records belonging to a stream into multiple shards. • It uses the partition key that is associated with each data record to determine which shard a given data record belongs to. • Partition keys are Unicode strings, with a maximum length limit of 256 characters for each key. • An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards. • When an application puts data into a stream, it must specify a partition key.
  • 23. @arafkarsh arafkarsh Kinesis Data Streams: Sequence Number 23 • Each data record has a sequence number that is unique per partition-key within its shard. • Kinesis Data Streams assigns the sequence number after you write to the stream with client.putRecords or client.putRecord. • Sequence numbers for the same partition key generally increase over time. • The longer the time period between write requests, the larger the sequence numbers become. Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
  • 24. @arafkarsh arafkarsh Kinesis Data Stream Lambda Config 24 Example Source: https://github.com/MetaArivu/Kinesis-Quickstart You can Control the Stream • Batch Size • Batch Window in Seconds • Max Retry
  • 25. @arafkarsh arafkarsh Kinesis Data Stream Lambda 25 Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
  • 26. @arafkarsh arafkarsh Multi Consumer Fan out Source: https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html 26
  • 27. @arafkarsh arafkarsh Data Stream – On Demand Scaling On-Demand • Automatically provisions the infrastructure • Max 200 MiB per Second OR • Max 200K Records per Second 27
  • 28. @arafkarsh arafkarsh Data Stream – Retention 1 Day 28
  • 29. @arafkarsh arafkarsh Data Stream – Retention 365 Days Retention Days • Min 1 Day • Max 365 Days 29
  • 31. @arafkarsh arafkarsh Security • Data is automatically encrypted before its stored in the Shard. • Encryption is done using AWS KMS Customer Master Key Server-Side Encryption 31
  • 32. @arafkarsh arafkarsh Kinesis Video Streams • Realtime using WebRTC • Batch Mode 32
  • 33. @arafkarsh arafkarsh Kinesis Video Streams Video Producer Library 1. Java 2. Android 3. C++ 4. C 33
  • 34. @arafkarsh arafkarsh Kinesis Video Stream Source: https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works.html Producer can be any video-generating device, such as a • security camera, • a body-worn camera, • a smartphone camera, or a • Dashboard camera. • A producer can also send non-video data, such as audio feeds, images, or RADAR data. A single producer can generate one or more video streams. For example, a video camera can push video data to one Kinesis video stream and audio data to another. Kinesis Video Streams Producer libraries • Install and configure on your devices. • Securely connect and reliably stream video in different ways, • including in real time, after buffering it for a few seconds, • or as after-the-fact media uploads. 34
  • 35. @arafkarsh arafkarsh Kinesis Video Stream End Points Examples: Sending Data to Kinesis Video Streams • Example: Kinesis Video Streams Producer SDK GStreamer Plugin: Shows how to build the Kinesis Video Streams Producer SDK to use as a GStreamer destination. • Run the GStreamer Element in a Docker Container: Shows how to use a pre-built Docker image for sending RTSP video from an IP camera to Kinesis Video Streams. • Example: Streaming from an RTSP Source: Shows how to build your own Docker image and send RTSP video from an IP camera to Kinesis Video Streams. • Example: Sending Data to Kinesis Video Streams Using the PutMedia API: Shows how to use the Using the Java Producer Library to send data to Kinesis Video Streams that is already in a container format Matroska (MKV) using the PutMedia API. GStreamer is a popular media framework used by a multitude of cameras and video sources to create custom media pipelines by combining modular plugins. • RTSP Camera on Ubuntu • USB Camera on Ubuntu • Camera on Raspberry Pi Source: https://docs.aws.amazon.com/kine sisvideostreams/latest/dg/examples -gstreamer-plugin.html 35
  • 36. @arafkarsh arafkarsh Kinesis Video Stream Kinesis video stream • Transport live video data, optionally store it • Data available for consumption both in real time and on a batch or ad hoc basis. • A Kinesis video stream has only one producer publishing data into it. The stream can carry • audio, • video, and • similar time-encoded data streams, such as • depth sensing feeds, • RADAR feeds, and more. Source: https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works.html Kinesis Video Stream Consumer (App) • Gets data, such as fragments and frames, from a Kinesis video stream • To view, process, or analyse it. Kinesis Video Stream Parser Library • To reliably get media from Kinesis video streams in a low-latency manner. • It parses the frame boundaries in the media so that applications can focus on processing and analysing the frames themselves. 36
  • 37. @arafkarsh arafkarsh Kinesis Video Stream Parser Library • StreamingMkvReader: This class reads specified MKV elements from a video stream. • FragmentMetadataVisitor: This class retrieves metadata for fragments (media elements) and tracks (individual data streams containing media information, such as audio or subtitles). • OutputSegmentMerger: This class merges consecutive fragments or chunks in a video stream. • KinesisVideoExample: This is a sample application that shows how to use the Kinesis Video Stream Parser Library. The library also includes tests that show how the tools are used. 37
  • 38. @arafkarsh arafkarsh Kinesis Data Firehose 38 2 Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
  • 39. @arafkarsh arafkarsh Kinesis Data Firehose Store Data • AWS S3 • AWS Redshift • AWS Elastic Search • Splunk • Kinesis Data Firehose is to Store the Streaming data into Data Stores, Lakes etc. • Firehose is used to Capture, Transform & Load Data into S3, Redshift etc. Kinesis Data Stream Kinesis Data Firehose Data Transformation using Lambda 39
  • 40. @arafkarsh arafkarsh Kinesis Data Firehose – Transformation Lambda 40 recordId • The record ID is passed from Kinesis Data Firehose to Lambda during the invocation. • The transformed record must contain the same record ID. • Any mismatch between the ID of the original record and the ID of the transformed record is treated as a data transformation failure. result The status of the data transformation of the record. The possible values are: • Ok (the record was transformed successfully), • Dropped (the record was dropped intentionally by your processing logic), and • ProcessingFailed (the record could not be transformed). If a record has a status of Ok or Dropped, Kinesis Data Firehose considers it successfully processed. Otherwise, Kinesis Data Firehose considers it unsuccessfully processed. data The transformed data payload, after base64-encoding. Source: https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html
  • 41. @arafkarsh arafkarsh Kinesis Firehose Lambda Config 41 Example Source: https://github.com/MetaArivu/Kinesis-Quickstart You can Control the Stream • Batch Size • Batch Window in Seconds • Max Retry
  • 43. @arafkarsh arafkarsh Kinesis Data Firehose – S3 43
  • 44. @arafkarsh arafkarsh Kinesis Data Firehose – S3 44
  • 45. @arafkarsh arafkarsh Kinesis Data Firehose – Direct Input 45
  • 46. @arafkarsh arafkarsh Kinesis Data Analytics • K 46 Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
  • 47. @arafkarsh arafkarsh Kinesis Data Analytics • Kinesis Data Analytics is used to Analyze the Streaming Data • Reduces the complexity in building and deploying Analytics Applications • Provides built-in Functions to Filter, Aggregate & Transform Streaming Data • Serverless Architecture • Under the hood its Apache Flink (v1.13) – December 2021 INPUT Kinesis Data Stream Kinesis Data Analytics OUTPUT Kinesis Data Stream 47
  • 48. @arafkarsh arafkarsh Kinesis Data Analytics – Architecture (Flink) AWS Cloud Kinesis Data Analytics Elastic Kubernetes Service Job Manager Task Manager Task Manager Task Manager S3 Bucket Auto Scaling Zookeeper Cloud Watch Cloud Watch Logs Flink Web UI 48
  • 51. @arafkarsh arafkarsh Apache Flink Open-Source Stream Processing Framework 51 3
  • 52. @arafkarsh arafkarsh Apache Flink Ease of Programming Stateful Processing High Performance Strong Data Integrity Flexible APIs for Programming Low Latency & Horizontally Scalable Stores Application States Exactly Once Processing & Consistent State 52 Is an Open-Source Stream Processing Framework
  • 53. @arafkarsh arafkarsh What is Apache Flink 53 Stateful Computations over Data Streams Batch Processing Process Static & historic Data Data Stream Processing Realtime Results from Data Streams Event Driven Applications Data Driven Actions and Services Instead of Spark + Hadoop
  • 54. @arafkarsh arafkarsh Use Case: Periodic ETL vs Streaming CTL 54 Traditional Periodic ETL • External Tool Periodically triggers ETL Batch Job Batch Processing Process Static & historic Data Data Stream Processing Realtime Results from Data Streams Continuous Streaming Data Pipeline • Ingestion with Low Latency • No Artificial Boundaries Streaming App Ingest Append Real Time Events Event Logs Batch Process Module Read Write Transactional Data Extract, Transform, Load Capture, Transform, Load State Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
  • 55. @arafkarsh arafkarsh Use Case: Data Analytics 55 • Great for Ad-Hoc Queries • Queries changes faster than data Batch Analytics Stream Analytics Ingest K-V Data Store Real Time Events Batch Analytics Read Write Recorded Events • High Performance Low Latency Result • Data Changes faster than Queries Analytics App State State Update Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
  • 56. @arafkarsh arafkarsh Use Case: Event Driven Application 56 • Compute & Data Tier Architecture • React to Process Events • State is stored in (Remote) Database Traditional Application Design Event Driven Application • High Performance Low Latency Result • Data Changes faster than Queries Application Read Write Events Trigger Action Ingest Real Time Events Application State Append Periodically write asynchronous checkpoints in Remote Database Event Logs Event Logs Trigger Action Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
  • 57. @arafkarsh arafkarsh Apache Flink Use Case Features • Business, Operational, Technical App Metrics • User Experience Metrics Real-time Analytics • Transform, Filter, Aggregate Streaming Data • IoT and Application Log Analysis Streaming ETL Applications • Trigger Conditions and External Notifications • Detecting Patterns / Anomaly Stateful Event Processing 57
  • 58. @arafkarsh arafkarsh Apache Flink Architecture • Architecture • Anatomy of the Flink Cluster • Tasks, Slots & Operator Chains • Anatomy of a Flink Program • Flink API & Operators 58
  • 60. @arafkarsh arafkarsh Deployment Model 60 Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/overview/ The Job Manager distributes the work onto theTask Managers, where the actual operators such as 1. sources, 2. transformations and 3. sinks are running. Job Manager is the name of the central work coordination component of Flink. Task Managers are the services actually performing the work of a Flink job.
  • 61. @arafkarsh arafkarsh Anatomy of the Flink Cluster 61 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/ Job Manager: Resource Manager It is responsible for resource de-/allocation and provisioning in a Flink cluster — it manages task slots, which are the unit of resource scheduling in a Flink cluster. Dispatcher It provides a REST interface to submit Flink applications for execution and starts a new Job Master for each submitted job. Job Master It is responsible for managing the execution of a single JobGraph. Multiple jobs can run simultaneously in a Flink cluster, each having its Job Master.
  • 62. @arafkarsh arafkarsh Job Manager HA 62 Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/overview/ Flink ships with two high availability service implementations: • ZooKeeper: ZooKeeper HA services can be used with every Flink cluster deployment.They require a running ZooKeeper quorum. • Kubernetes: Kubernetes HA services only work when running on Kubernetes. Flink’s high availability services encapsulate the required services to make everything work: • Leader election: Selecting a single leader out of a pool of n candidates • Service discovery: Retrieving the address of the current leader • State persistence: Persisting state which is required for the successor to resume the job execution (Job Graphs, user code jars, completed checkpoints
  • 63. @arafkarsh arafkarsh Deployment Modes 63 Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/overview/
  • 64. @arafkarsh arafkarsh Task & Operator Chains 64 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/ • For distributed execution, Flink chains operator subtasks together into tasks • Each task is executed by one thread. • Chaining operators together into tasks is a useful optimization: • it reduces the overhead of thread-to-thread handover and buffering, • and increases overall throughput while decreasing latency. T1 T2 T3 T4 T5
  • 65. @arafkarsh arafkarsh Task Slots 65 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/ • Each worker (Task Manager) is a JVM process and may execute one or more subtasks in separate threads. • To control how many tasks a Task Manager accepts, it has so called task slots (at least one). • Memory is divided equally across the slots. • No CPU isolation across task slot. • Having multiple slots means more subtasks share the same JVM. • Tasks in the same JVM share TCP connections (via multiplexing) and heartbeat messages. • They may also share data sets and data structures, thus reducing the per-task overhead. • Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job.
  • 66. @arafkarsh arafkarsh Anatomy of a Flink Program 66 1. Obtain an execution environment, 2. Load/create the initial data, 3. Specify transformations on this data, 4. Specify where to put the results of your computations, 5. Trigger the program execution. Will be triggered on your local machine or submit your program for execution on a cluster. Source Transform Transform Sink 1 2 3 5 4 Each program consists of the same basic parts: Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/#anatomy-of-a-flink-program
  • 67. @arafkarsh arafkarsh External Components 67 Feature Description Implementation 1 High Availability Service Provider Flink's Job Manager can be run in high availability mode which allows Flink to recover from Job Manager faults. In order to failover faster, multiple standby Job Managers can be started to act as backups. • Zookeeper • Kubernetes HA 2 File Storage and Persistency For checkpointing (recovery mechanism for streaming jobs) Flink relies on external file storage systems See FileSystems page. 3 Resource Provider Flink can be deployed through different Resource Provider Frameworks, such as Kubernetes, YARN or Mesos. • Kubernetes • YARN • Mesos 4 Metrics Storage Flink components report internal metrics and Flink jobs can report additional, job specific metrics as well. See Metrics Reporter page. 5 Application-level data sources and sinks While application-level data sources and sinks are not technically part of the deployment of Flink cluster components, they should be considered when planning a new Flink production deployment. Colocating frequently used data with Flink can have significant performance benefits For example: • Apache Kafka • Amazon S3 • Amazon Kinesis • Elastic Search See Connectors page. Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/overview/
  • 69. @arafkarsh arafkarsh Flink API 69 Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/concepts/overview/
  • 70. @arafkarsh arafkarsh Apache Flink DataStream API • Data Source • Operators • Data Sink • Generating Watermarks 70
  • 71. @arafkarsh arafkarsh DataStream 71 • A DataStream is similar to a regular Java Collection in terms of usage but is quite different in some keyways. • They are immutable, meaning that once they are created you cannot add or remove elements. • You can also not simply inspect the elements inside but only work on them using the DataStream API operations, which are also called transformations. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/ Reading from Socket
  • 72. @arafkarsh arafkarsh Data Sources 72 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/ File-based: • readTextFile(path) - Reads text files, i.e. files that respect the TextInputFormat specification, line-by- line and returns them as Strings. • readFile(fileInputFormat, path) - Reads (once) files as dictated by the specified file input format. • readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) - This is the method called internally by the two previous ones. It reads files in the path based on the given fileInputFormat. Depending on the provided watchType, this source may periodically monitor (every interval ms) the path for new data (FileProcessingMode.PROCESS_CONTINUOUSLY), or process once the data currently in the path and exit (FileProcessingMode.PROCESS_ONCE). Using the pathFilter, the user can further exclude files from being processed.
  • 73. @arafkarsh arafkarsh Data Sources 73 Socket-based: • socketTextStream - Reads from a socket. Elements can be separated by a delimiter. Collection-based: • fromCollection(Collection) - Creates a data stream from the Java Java.util.Collection. All elements in the collection must be of the same type. • fromCollection(Iterator, Class) - Creates a data stream from an iterator. The class specifies the data type of the elements returned by the iterator. • fromElements(T ...) - Creates a data stream from the given sequence of objects. All objects must be of the same type. • fromParallelCollection(SplittableIterator, Class) - Creates a data stream from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator. • generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel. Custom: • addSource - Attach a new source function. For example, to read from Apache Kafka you can use addSource(new FlinkKafkaConsumer<>(...)). See connectors for more details. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/
  • 74. @arafkarsh arafkarsh Data Sources: Custom Connectors 74 1. Apache Kafka (source/sink) 2. Apache Cassandra (sink) 3. Amazon Kinesis Streams (source/sink) 4. Elasticsearch (sink) 5. FileSystem (Hadoop included) - Streaming only sink (sink) 6. FileSystem (Hadoop included) - Streaming and Batch sink (sink) 7. [FileSystem (Hadoop included) - Batch source] (//nightlies.apache.org/flink/flink-docs- release1.14/docs/connectors/datastream/formats/) (source) 8. RabbitMQ (source/sink) 9. Google PubSub (source/sink) 10. Hybrid Source (source) 11. Apache NiFi (source/sink) 12. Apache Pulsar (source) 13. Twitter Streaming API (source) 14. JDBC (sink) Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/overview/ Bundled Connectors 1. Apache ActiveMQ (source/sink) 2. Apache Flume (sink) 3. Redis (sink) 4. Akka (sink) 5. Netty (source) Apache Bahir
  • 75. @arafkarsh arafkarsh Data Sink 75 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/ • writeAsText() / TextOutputFormat - Writes elements line-wise as Strings. The Strings are obtained by calling the toString() method of each element. • writeAsCsv(...) / CsvOutputFormat - Writes tuples as comma-separated value files. Row and field delimiters are configurable. The value for each field comes from the toString() method of the objects. • print() / printToErr() - Prints the toString() value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between different calls to print. If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output. • writeUsingOutputFormat() / FileOutputFormat - Method and base class for custom file outputs. Supports custom object-to-bytes conversion. • writeToSocket - Writes elements to a socket according to a SerializationSchema • addSink - Invokes a custom sink function. Flink comes bundled with connectors to other systems (such as Apache Kafka) that are implemented as sink functions. Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them
  • 76. @arafkarsh arafkarsh Execution Mode – Batch / Streaming 76 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/execution_mode/ The execution mode can be configured via the execution.runtime-mode setting. There are three possible values: 1. STREAMING: The classic DataStream execution mode (default) 2. BATCH: Batch-style execution on the DataStream API 3. AUTOMATIC: Let the system decide based on the boundedness of the sources • The BATCH execution mode can only be used for Jobs/Flink Programs that are bounded. • Boundedness is a property of a data source that tells us whether all the input coming from that source is known before execution or whether new data will show up, potentially indefinitely. • A job, in turn, is bounded if all its sources are bounded, and unbounded otherwise. • STREAMING execution mode, on the other hand, can be used for both bounded and unbounded jobs. • As a rule of thumb, you should be using BATCH execution mode when your program is bounded because this will be more efficient.
  • 77. @arafkarsh arafkarsh Stream Processing: Operators 77 Map Takes one element and produces one element. A map function that doubles the values of the input stream: Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/ Flat Map Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words: Filter Evaluates a Boolean function for each element and retains those for which the function returns true. Key By Logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.
  • 78. @arafkarsh arafkarsh Stream Processing: Operators 78 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/ Reduce A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value. Union Union of two or more data streams creating a new stream containing all the elements from all the streams. Join Join two data streams on a given key and a common window. Join Interval Join two elements e1 and e2 of two keyed streams with a common key over a given time interval, so that e1.timestamp + lowerBound <= e2.timestamp <= e1.timestamp + upperBound
  • 79. @arafkarsh arafkarsh Stream Processing: Operators 79 Window All Windows can be defined on regular Data Streams. Windows group all the stream events according to some characteristic. Window Apply Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window. Window Reduce Applies a functional reduce function to the window and returns the reduced value. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/ Window Windows can be defined on already partitioned Keyed Streams. Windows group the data in each key according to some characteristic.
  • 80. @arafkarsh arafkarsh Watermarks 80 1. Watermarks are provided by the Data Source of Application 2. They are part of the Stream and carry a timestamp 3. A Watermark asserts that all earlier events have probably arrived • Watermark w9 asserts that all the events with time < w9 has arrived. • Watermark w15 asserts that all the events with time < w15 has arrived. 27 Event Stream 25 13 21 4 10 13 12 15 8 7 11 1 3 w9 w15 w5 18 w21 Event Timestamp Watermarks Late Events
  • 81. @arafkarsh arafkarsh Goal: Count Events in 10 Seconds Windows 81 0 – 10 Seconds 11 – 20 Seconds 21 – 30 Seconds 8 7 11 1 3 15 13 12 10 4 18 13 21 27 Event Stream 25 13 21 4 10 13 12 15 8 7 11 1 3 w9 w15 w5 18 w21 Event Timestamp Watermarks Late Events 27 25 R1 R2 R1 R2 Event Time Timers
  • 82. @arafkarsh arafkarsh Allowed Lateness 82 • Once a window is fired it’s state is freed & all the late events are dropped. • You can avoid the dropping of the late events by configuring the max time to wait for the late events. • With Sufficient lateness allowed Event [4] and [13] are updated in the respective window and result is updated (R2) stream.window(<window assigner>).allowedLateness(<timer>)
  • 83. @arafkarsh arafkarsh Timers 83 Explicit o TimerService timerService = context.timerService(); o timerService.registerEventTimeTimer(event.timestamp); // Time In Millis o timerService.registerProcessingTimeTimer(event.timestamp); // Time In Millis Implicit o stream.window(TumblingEventTimeWindows.of(Time.seconds(7))) o stream.window(TumblingProcessingTimeWindows.of(Time.seconds(7))) o SELECT user, SUM(amount) o FROM Orders o GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR), user Source: Streaming Concepts & Introduction – Feb 1, 2021: https://www.youtube.com/watch?v=QVDJFZVHZ3c
  • 84. @arafkarsh arafkarsh Watermarks – In Order Events 84 Watermarks: • To measure progress in event time. • It flow as part of the data stream and carry a timestamp t. • A Watermark(t) declares that event time has reached time t in that stream. • Meaning that there should be no more elements from the stream with a timestamp t' <= t (i.e. events with timestamps older or equal to the watermark). Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
  • 85. @arafkarsh arafkarsh Watermarks – Out of Order Events 85 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/ • A watermark is a declaration that by that point in the stream, all events up to a certain timestamp should have arrived. • Once a watermark reaches an operator, the operator can advance its internal event time clock to the value of the watermark.
  • 86. @arafkarsh arafkarsh Watermarks – in Parallel Streams 86 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/ • Watermarks are generated at, or directly after, source functions. • Each parallel subtask of a source function usually generates its watermarks independently. • These watermarks define the event time at that particular parallel source.
  • 87. @arafkarsh arafkarsh Generating Watermarks 87 In order to work with event time, Flink needs to know the events timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element by using a Timestamp Assigner. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ Specifying a Timestamp Assigner is optional, and, in most cases, you don’t actually want to specify one. For example, when using Kafka or Kinesis you would get timestamps directly from the Kafka/Kinesis records. Idle Input Source If one of the input splits/partitions/shards does not carry events for a while this means that the Watermark Generator also does not get any new information on which to base a watermark. To deal with this, you can use a Watermark Strategy that will detect idleness and mark an input as idle.
  • 88. @arafkarsh arafkarsh Watermark Strategies 88 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ There are two places in Flink applications where a Watermark Strategy can be used: 1. directly on sources and (RECOMMENDED) 2. after non-source operation. The first option is preferable, because • it allows sources to exploit knowledge about shards/partitions/splits in the watermarking logic. • Sources can usually then track watermarks at a finer level and • the overall watermark produced by a source will be more accurate. The second option (setting a Watermark Strategy after arbitrary operations) should only be used if you cannot set a strategy directly on the source. After non-source operation
  • 89. @arafkarsh arafkarsh Periodic Watermark Generator 89 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ A periodic generator observes stream events and generates watermarks periodically (possibly depending on the stream elements, or purely based on processing time).
  • 90. @arafkarsh arafkarsh Punctuated Watermark Generator 90 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ A punctuated watermark generator will observe the stream of events and emit a watermark whenever it sees a special element that carries watermark information.
  • 91. @arafkarsh arafkarsh Watermark Summary 91 • Flink Supports different Types of Time • Event Time • Processing Time • With Event Time • Events can be out of Order • Expect Deterministic Results • Event time Applications are Responsible for • Providing Watermarks • Deciding how to handle late events • Streaming Applications must trade off Completeness for Latency • Can wait longer to have more complete information before acting • Can wait less to reduce latency • Watermarks are the mechanism for managing this trade off Source: https://www.youtube.com/watch?v=QVDJFZVHZ3c
  • 92. @arafkarsh arafkarsh Core Building Blocks • Event Time • Event Streams • State • Snapshots 92
  • 93. @arafkarsh arafkarsh Flink Core Building Blocks 93 Event Streams Real-time & hindsight State Complex Business Logic Consistency with out-of- order data & Late data Event Time Snapshots Forking / versioning / Time Travel Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw
  • 94. @arafkarsh arafkarsh Flink API Architecture (v1.14) 94 Table / SQL API Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw Relational Planner DataStream API Stateful Functions Internal Streams API Runtime
  • 95. @arafkarsh arafkarsh 95 Consistency with out-of- order data & Late data Event Time
  • 96. @arafkarsh arafkarsh Handling Time 96 Partition 2 Partition 1 Partition 3 Messaging Layer Kafka / Kinesis Data Streams Event Time Broker Time Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/concepts/time/ Event Producer Flink Data Source Flink Window Operator [ ] [ ] Processing Time Ingestion Time
  • 97. @arafkarsh arafkarsh Handling Event Time 97 • Can Ensure Ordering of Event Time • Increases Latency for Ordered Event Time • Flink Reconstruct the order Event time: Event time is the time that each individual event occurred on its producing device. Processing time: Processing time refers to the system time of the machine that is executing the respective operation. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
  • 98. @arafkarsh arafkarsh Windows 98 • Windows are at the heart of processing infinite streams. • Windows split the stream into “buckets” of finite size, over which we can apply computations. • It is created as soon as the first element that should belong to this window arrives, and the • Window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness. • Flink guarantees removal only for time-based windows. • 2 Category of Windows – Keyed keyBy(…) and non-Keyed Windows windowAll(…) Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ Types of Windows 1. Time Windows 2. Count Windows
  • 99. @arafkarsh arafkarsh Window (Sliding, Tumbling, Hopping, Session) 99 Source: https://docs.microsoft.com/en-us/stream-analytics-query/windowing-azure-stream-analytics Sliding Tumbling Hopping Session
  • 100. @arafkarsh arafkarsh Window – Tumbling 100 Tumbling windows have a fixed size and do not overlap. • Without offsets hourly tumbling windows are aligned with epoch, that is you will get windows such as • 1:00:00.000 - 1:59:59.999, 2:00:00.000 - 2:59:59.999 and so on. • Offset of 15 minutes you would, for example, get 1:15:00.000 - 2:14:59.999. • An important use case for offsets is to adjust windows to time zones other than UTC-0. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
  • 101. @arafkarsh arafkarsh Window – Sliding 101 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ You could have windows of size 10 minutes that slides by 5 minutes.With this you get every 5 minutes a window that contains the events that arrived during the last 10 minutes
  • 102. @arafkarsh arafkarsh Window – Session 102 • The session windows groups elements by sessions of activity. • Session windows do not overlap and do not have a fixed start and end time. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
  • 103. @arafkarsh arafkarsh Window – Global 103 • This windowing scheme is only useful if you also specify a custom trigger. • Otherwise, no computation will be performed, as the global window does not have a natural end at which we could process the aggregated elements. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
  • 104. @arafkarsh arafkarsh Window Functions 104 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ Reduce Function Aggregate Function Process Window Function Process Window Function with Incremental Aggregation • Window functions are used to specify the computation needs to happen on the window. • This is done when a Window is ready for Processing. • Triggers are used to determine when the Window is ready for Computation. The window function can be one of Reduce Function, Aggregate Function, or Process Window Function. The Reduce Function, Aggregate Function can be executed more efficiently because Flink can incrementally aggregate the elements for each window as they arrive. A Process Window Function gets an Iterable for all the elements contained in a window and additional meta information about the window to which the elements belong.
  • 105. @arafkarsh arafkarsh Reduce Function 105 A Reduce Function specifies how two elements from the input are combined to produce an output element of the same type. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
  • 106. @arafkarsh arafkarsh Aggregate Function 106 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ An Aggregate Function is a generalized version of a Reduce Function that has three types: 1. an input type (IN), 2. accumulator type (ACC), 3. and an output type (OUT). The input type is the type of elements in the input stream and the Aggregate Function has a method for adding one input element to an accumulator. The interface also has methods for 1. creating an initial accumulator, 2. for merging two accumulators into one accumulator and for 3. extracting an output (of type OUT) from an accumulator. Same as with Reduce Function, Flink will incrementally aggregate input elements of a window as they arrive.
  • 107. @arafkarsh arafkarsh Process Function 107 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ A Process Window Function gets an Iterable containing all the elements of the window, and a Context object with access to time and state information, which enables it to provide more flexibility than other window functions.
  • 108. @arafkarsh arafkarsh Process Function with Incremental Aggregation 108 A Process Window Function can be combined with either a Reduce Function, or an Aggregate Function to incrementally aggregate elements as they arrive in the window. When the window is closed, the Process Window Function will be provided with the aggregated result. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
  • 109. @arafkarsh arafkarsh Trigger 109 • A Trigger determines when a window (as formed by the window assigner) is ready to be processed by the window function. • It comes with a default Trigger. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ 1. The onElement() method is called for each element that is added to a window. 2. The onEventTime() method is called when a registered event-time timer fires. 3. The onProcessingTime() method is called when a registered processing-time timer fires. 4. The onMerge() method is relevant for stateful triggers and merges the states of two triggers when their corresponding windows merge, e.g. when using session windows. 5. The clear() method performs any action needed upon removal of the corresponding window.
  • 110. @arafkarsh arafkarsh Evictor 110 Flink’s windowing model allows specifying an optional Evictor in addition to the Window Assigner and the Trigger.This can be done using the evictor(...) method (shown in the beginning of this document).The evictor has the ability to remove elements from a window after the trigger fires and before and/or after the window function is applied. Flink comes with three pre-implemented evictors.These are: • Count Evictor: keeps up to a user-specified number of elements from the window and discards the remaining ones from the beginning of the window buffer. • Delta Evictor: takes a Delta Function and a threshold, computes the delta between the last element in the window buffer and each of the remaining ones, and removes the ones with a delta greater or equal to the threshold. • Time Evictor: takes as argument an interval in milliseconds and for a given window, it finds the maximum timestamp max_ts among its elements and removes all the elements with timestamps smaller than max_ts - interval. • By default, all the pre-implemented evictors apply their logic before the window function Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
  • 111. @arafkarsh arafkarsh Handling Late Events 111 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/#allowed-lateness • By default, the allowed lateness is set to 0. • That is, elements that arrive behind the watermark will be dropped.
  • 112. @arafkarsh arafkarsh Late Events – Side Out 112 Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ Using Flink’s side output feature you can get a stream of the data that was discarded as late.
  • 113. @arafkarsh arafkarsh 113 Event Streams Real-time & hindsight State Complex Business Logic
  • 114. @arafkarsh arafkarsh Streams & Batch Processing 114 • Processes “unbounded” (stream) and “bounded” (batch) data • Processes recorded (offline) and live (real-time) data • Batch is just a special case of streaming data Event Log Bounded Stream Bounded Stream now Unbounded Stream Unbounded Stream Start of the Stream Past Future Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw
  • 115. @arafkarsh arafkarsh Stateful Event & Stream Processing 115 Source Transform Transform Sink Source Transform Window Sink Streaming Data Flow
  • 116. @arafkarsh arafkarsh Stateful Event & Stream Processing 116 Source Filter & Transform Window State Read & Write Sink 1 2 keyBy(R1, R3, R5) keyBy(R2, R4, R6) Scalable Local State Scalable Local State keyBy() keyBy() High Performance In Memory Computing & Parallelize the Tasks Raw Events Raw Events New Aggregated Event External Storage For Snapshots
  • 117. @arafkarsh arafkarsh 117 Snapshots Forking / versioning / Time Travel
  • 118. @arafkarsh arafkarsh Storage for States 118 Processor State External External Storage Processor State Snapshots Internal Storage Internal • Independent from Processing • Low Performance due to remote Storage • Hard to get ”Exactly-Once” guarantees • Highly consistent distributed Snapshotting • Faster access with Local Storage • Stream processor needs to handle scaling and storage
  • 119. @arafkarsh arafkarsh Checkpoint Barrier 119 Source Filter & Transform Window State Read & Write Sink 1 2 keyBy(R1, R3, R5) keyBy(R2, R4, R6) keyBy() keyBy() Checkpoint Barrier Stream partition Record offset Local State Local State Fault Tolerant Storage HDFS, S3, NFS… Snapshots Trigger Checkpoint via RPC from Job Manager Aggregate, Count etc., Aggregate, Count etc., Record offset Aggregate, Count etc., Message Shards / Partitions
  • 120. @arafkarsh arafkarsh Snapshot Alignment 120 Source Filter & Transform Window State Read & Write 1 2 keyBy() keyBy() Checkpoint Barrier Stream partition Record offset Local State Local State Fault Tolerant Storage HDFS, S3, NFS… Snapshots Trigger Checkpoint via RPC from Job Manager Aggregate, Count etc., Aggregate, Count etc., Record offset Aggregate, Count etc., Message Shards / Partitions
  • 121. @arafkarsh arafkarsh Snapshot Alignment 121 Source Filter & Transform Window State Read & Write 1 2 keyBy() keyBy() Checkpoint Barrier Stream partition Record offset Local State Local State Fault Tolerant Storage HDFS, S3, NFS… Snapshots Trigger Checkpoint via RPC from Job Manager Aggregate, Count etc., Aggregate, Count etc., Record offset Aggregate, Count etc., Message Shards / Partitions
  • 122. @arafkarsh arafkarsh Snapshots & Fault Tolerance 122 Source Filter & Transform Window State Read & Write Sink 1 2 keyBy(R1, R3, R5) keyBy(R2, R4, R6) Local Storage Local Storage keyBy() keyBy() Reload State Reset Positions in Input Stream Rolling back Computation Re-Processing Fault Tolerant Storage HDFS, S3, NFS… Snapshots
  • 123. @arafkarsh arafkarsh Configure Checkpoints – Local Storage 123 Processor State Snapshots HashMap State Backend • Store the state in Memory (HashMap) • Faster access with Memory Storage • Subject to Garbage Collection Processor State Snapshots RocksDB State Backend • Stores the state in Local RocksDB • Limited only by Local Disk Size • Slower than Memory Storage (10x Slower) • Serialize on write and DeSerialize on Read RocksDB Key Value Storage • Jobs with large state, long windows, large key/value states. • All high-availability setups • Jobs with large state, long windows, large key/value states. • All high-availability setups
  • 124. @arafkarsh arafkarsh Integration & Comparisons • Integration • Comparison with Spark 124
  • 125. @arafkarsh arafkarsh Integrations 125 • Event Logs • Kafka, AWS Kinesis, Pulsar • File Systems • HDFS, NFS, S3, MapR FS… • Databases • JDBC, Hcatalog • Encodings • Avro, JSON, CSV, Parquet, ORC • Key Value Stores • Redis, Cassandra, Elastic Search
  • 126. @arafkarsh arafkarsh Apache Flink Vs Apache Spark 126 Features Flink Spark 1 Developed in Java Scala 2 Streaming Model Windowing & Checkpoints Micro batching 3 Real Time Processing Real time Processing Near Real time 4 Models Data Stream / Table SQL RDD 5 Performance High Medium 6 Supported Languages Java, Scala, Python, SQL Java, Scala, Python, R, SQL 7 SQL Analytics Yes Yes 8 Runs on Hadoop, Mesos, Kubernetes, AWS Kinesis, …. Hadoop, Mesos, Kubernetes AWS EMR 9 Machine Learning Yes - FlinkML Yes FlinkML: https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/libs/ml/index.html
  • 127. @arafkarsh arafkarsh Flink Summary 127 1. Distributed and Fault Tolerant 2. Stateful, No DB Needed 3. Horizontally Scalable 4. Parallel Execution, No Concurrency Issues
  • 128. @arafkarsh arafkarsh Case Studies 1. HP Ink Cartridge Manufacturing Process 2. Infor: Compliance Violation (Banking) 3. Biogen: Centralized Log Management 4. Viber: Massive Data Handling - 300 Msgs / Second 5. AWS: IoT Data using Firehose and Data Analytics 6. Nordstrom: Ledger with Multi Data Views 128 4
  • 129. @arafkarsh arafkarsh HP: Ink Cartridge Manufacturing Process • From Factory Data comes Kinesis • Using Lambda’s Data is stored in DynamoDB (Sequential Ops) • Firehose stores Raw Data in S3 • Enriched Data is stored in Aurora, Elastic Search and S3 • Glue is used for Batch Process Source: https://www.youtube.com/watch?v=KM5ONS2fnG0 129
  • 130. @arafkarsh arafkarsh Infor: Compliance Violation Realtime / Batch • Security & Tx Data is sent to Kinesis Data Stream • Services in Fargate picks up the data from KDS send to Aurora & S3 • Scheduler (5) invokes service to EMR processing. • EMR fetch data from Aurora & S3 and sends data to Event bridge • Event Bridge (10) sent data to SQS • Service in Fargate picks up the data from SQS and sends out email. Source: https://www.youtube.com/watch?v=0gNMEyei-co 130
  • 131. @arafkarsh arafkarsh Biogen: Centralized Log Management • Application, Network and VPC Logs sent to Kinesis Firehose • Firehose (4) sends data to Lambda • Lambda (5) Enrich / Normalize the data and stores in S3 • Lambda (7)npicks up the data from S3 and stores in Elastic Search • Kibana is used for Data Visualization. Source: https://www.youtube.com/watch?v=m8xtR3-ZQs8 131
  • 132. @arafkarsh arafkarsh Viber: Massive Data Lakes 300k Msgs / Second • From Viber BE events are batched and send to Kinesis. • Using KCL in Apache Storm Events are picked from Kinesis and using Firehose Events are stored in S3 • Aggregated Data is Sent to another Kinesis Stream and using a Lambda the event is send in Viber BE based on Rules. Source: https://www.youtube.com/watch?v=7i1tj59pvYw EMR – Elastic Map Reduce 132
  • 133. @arafkarsh arafkarsh Nordstrom: Ledger with Multi Data Views • Customer Data is stored in Kinesis Data Stream as Raw Data (Ledger) • Firehose Stores (4) Raw Data in S3 Bucket • Lambda (5.1-5.3) Transforms and stores data in different DB in different formats for various Read usages. Source: https://www.youtube.com/watch?v=O7PTtm_3Os4 133
  • 134. @arafkarsh arafkarsh AWS: IoT Data – Firehose – Analytics – DynamoDB • MQTT based Data from IoT • Firehose stores the data in S3 • Kinesis DA get the data from Firehose analyze it and stores send to Firehose to store in S3 • Using Lambda the data is enriched and stored in DynamoDB • Using Web Based App user gets the data from DynamoDB Source: https://www.youtube.com/watch?v=uWUAcc68MWI 134
  • 135. @arafkarsh arafkarsh 135 Design Patterns are solutions to general problems that software developers faced during software development. Design Patterns
  • 136. @arafkarsh arafkarsh 136 DREAM | AUTOMATE | EMPOWER Araf Karsh Hamid : India: +91.999.545.8627 http://www.slideshare.net/arafkarsh https://www.linkedin.com/in/arafkarsh/ https://www.youtube.com/user/arafkarsh/playlists http://www.arafkarsh.com/ @arafkarsh arafkarsh
  • 137. @arafkarsh arafkarsh 137 Source Code: https://github.com/MetaArivu Web Site: https://metarivu.com/ https://pyxida.cloud/
  • 139. @arafkarsh arafkarsh References 1. July 15, 2015 – Agile is Dead : GoTo 2015 By Dave Thomas 2. Apr 7, 2016 - Agile Project Management with Kanban | Eric Brechner | Talks at Google 3. Sep 27, 2017 - Scrum vs Kanban - Two Agile Teams Go Head-to-Head 4. Feb 17, 2019 - Lean vs Agile vs Design Thinking 5. Dec 17, 2020 - Scrum vs Kanban | Differences & Similarities Between Scrum & Kanban 6. Feb 24, 2021 - Agile Methodology Tutorial for Beginners | Jira Tutorial | Agile Methodology Explained. Agile Methodologies 139
  • 140. @arafkarsh arafkarsh References 1. Vmware: What is Cloud Architecture? 2. Redhat: What is Cloud Architecture? 3. Cloud Computing Architecture 4. Cloud Adoption Essentials: 5. Google: Hybrid and Multi Cloud 6. IBM: Hybrid Cloud Architecture Intro 7. IBM: Hybrid Cloud Architecture: Part 1 8. IBM: Hybrid Cloud Architecture: Part 2 9. Cloud Computing Basics: IaaS, PaaS, SaaS 140 1. IBM: IaaS Explained 2. IBM: PaaS Explained 3. IBM: SaaS Explained 4. IBM: FaaS Explained 5. IBM: What is Hypervisor? Cloud Architecture
  • 141. @arafkarsh arafkarsh References Microservices 1. Microservices Definition by Martin Fowler 2. When to use Microservices By Martin Fowler 3. GoTo: Sep 3, 2020: When to use Microservices By Martin Fowler 4. GoTo: Feb 26, 2020: Monolith Decomposition Pattern 5. Thought Works: Microservices in a Nutshell 6. Microservices Prerequisites 7. What do you mean by Event Driven? 8. Understanding Event Driven Design Patterns for Microservices 141
  • 142. @arafkarsh arafkarsh References – Microservices – Videos 142 1. Martin Fowler – Micro Services : https://www.youtube.com/watch?v=2yko4TbC8cI&feature=youtu.be&t=15m53s 2. GOTO 2016 – Microservices at NetFlix Scale: Principles, Tradeoffs & Lessons Learned. By R Meshenberg 3. Mastering Chaos – A NetFlix Guide to Microservices. By Josh Evans 4. GOTO 2015 – Challenges Implementing Micro Services By Fred George 5. GOTO 2016 – From Monolith to Microservices at Zalando. By Rodrigue Scaefer 6. GOTO 2015 – Microservices @ Spotify. By Kevin Goldsmith 7. Modelling Microservices @ Spotify : https://www.youtube.com/watch?v=7XDA044tl8k 8. GOTO 2015 – DDD & Microservices: At last, Some Boundaries By Eric Evans 9. GOTO 2016 – What I wish I had known before Scaling Uber to 1000 Services. By Matt Ranney 10. DDD Europe – Tackling Complexity in the Heart of Software By Eric Evans, April 11, 2016 11. AWS re:Invent 2016 – From Monolithic to Microservices: Evolving Architecture Patterns. By Emerson L, Gilt D. Chiles 12. AWS 2017 – An overview of designing Microservices based Applications on AWS. By Peter Dalbhanjan 13. GOTO Jun, 2017 – Effective Microservices in a Data Centric World. By Randy Shoup. 14. GOTO July, 2017 – The Seven (more) Deadly Sins of Microservices. By Daniel Bryant 15. Sept, 2017 – Airbnb, From Monolith to Microservices: How to scale your Architecture. By Melanie Cubula 16. GOTO Sept, 2017 – Rethinking Microservices with Stateful Streams. By Ben Stopford. 17. GOTO 2017 – Microservices without Servers. By Glynn Bird.
  • 143. @arafkarsh arafkarsh References 143 Domain Driven Design 1. Oct 27, 2012 What I have learned about DDD Since the book. By Eric Evans 2. Mar 19, 2013 Domain Driven Design By Eric Evans 3. Jun 02, 2015 Applied DDD in Java EE 7 and Open Source World 4. Aug 23, 2016 Domain Driven Design the Good Parts By Jimmy Bogard 5. Sep 22, 2016 GOTO 2015 – DDD & REST Domain Driven API’s for the Web. By Oliver Gierke 6. Jan 24, 2017 Spring Developer – Developing Micro Services with Aggregates. By Chris Richardson 7. May 17. 2017 DEVOXX – The Art of Discovering Bounded Contexts. By Nick Tune 8. Dec 21, 2019 What is DDD - Eric Evans - DDD Europe 2019. By Eric Evans 9. Oct 2, 2020 - Bounded Contexts - Eric Evans - DDD Europe 2020. By. Eric Evans 10. Oct 2, 2020 - DDD By Example - Paul Rayner - DDD Europe 2020. By Paul Rayner
  • 144. @arafkarsh arafkarsh References Event Sourcing and CQRS 1. IBM: Event Driven Architecture – Mar 21, 2021 2. Martin Fowler: Event Driven Architecture – GOTO 2017 3. Greg Young: A Decade of DDD, Event Sourcing & CQRS – April 11, 2016 4. Nov 13, 2014 GOTO 2014 – Event Sourcing. By Greg Young 5. Mar 22, 2016 Building Micro Services with Event Sourcing and CQRS 6. Apr 15, 2016 YOW! Nights – Event Sourcing. By Martin Fowler 7. May 08, 2017 When Micro Services Meet Event Sourcing. By Vinicius Gomes 144
  • 145. @arafkarsh arafkarsh References 145 Kafka 1. Understanding Kafka 2. Understanding RabbitMQ 3. IBM: Apache Kafka – Sept 18, 2020 4. Confluent: Apache Kafka Fundamentals – April 25, 2020 5. Confluent: How Kafka Works – Aug 25, 2020 6. Confluent: How to integrate Kafka into your environment – Aug 25, 2020 7. Kafka Streams – Sept 4, 2021 8. Kafka: Processing Streaming Data with KSQL – Jul 16, 2018 9. Kafka: Processing Streaming Data with KSQL – Nov 28, 2019
  • 146. @arafkarsh arafkarsh References Databases: Big Data / Cloud Databases 1. Google: How to Choose the right database? 2. AWS: Choosing the right Database 3. IBM: NoSQL Vs. SQL 4. A Guide to NoSQL Databases 5. How does NoSQL Databases Work? 6. What is Better? SQL or NoSQL? 7. What is DBaaS? 8. NoSQL Concepts 9. Key Value Databases 10. Document Databases 11. Jun 29, 2012 – Google I/O 2012 - SQL vs NoSQL: Battle of the Backends 12. Feb 19, 2013 - Introduction to NoSQL • Martin Fowler • GOTO 2012 13. Jul 25, 2018 - SQL vs NoSQL or MySQL vs MongoDB 14. Oct 30, 2020 - Column vs Row Oriented Databases Explained 15. Dec 9, 2020 - How do NoSQL databases work? Simply Explained! 1. Graph Databases 2. Column Databases 3. Row Vs. Column Oriented Databases 4. Database Indexing Explained 5. MongoDB Indexing 6. AWS: DynamoDB Global Indexing 7. AWS: DynamoDB Local Indexing 8. Google Cloud Spanner 9. AWS: DynamoDB Design Patterns 10. Cloud Provider Database Comparisons 11. CockroachDB: When to use a Cloud DB? 146
  • 147. @arafkarsh arafkarsh References Docker / Kubernetes / Istio 1. IBM: Virtual Machines and Containers 2. IBM: What is a Hypervisor? 3. IBM: Docker Vs. Kubernetes 4. IBM: Containerization Explained 5. IBM: Kubernetes Explained 6. IBM: Kubernetes Ingress in 5 Minutes 7. Microsoft: How Service Mesh works in Kubernetes 8. IBM: Istio Service Mesh Explained 9. IBM: Kubernetes and OpenShift 10. IBM: Kubernetes Operators 11. 10 Consideration for Kubernetes Deployments Istio – Metrics 1. Istio – Metrics 2. Monitoring Istio Mesh with Grafana 3. Visualize your Istio Service Mesh 4. Security and Monitoring with Istio 5. Observing Services using Prometheus, Grafana, Kiali 6. Istio Cookbook: Kiali Recipe 7. Kubernetes: Open Telemetry 8. Open Telemetry 9. How Prometheus works 10. IBM: Observability vs. Monitoring 147
  • 148. @arafkarsh arafkarsh References 148 1. Feb 6, 2020 – An introduction to TDD 2. Aug 14, 2019 – Component Software Testing 3. May 30, 2020 – What is Component Testing? 4. Apr 23, 2013 – Component Test By Martin Fowler 5. Jan 12, 2011 – Contract Testing By Martin Fowler 6. Jan 16, 2018 – Integration Testing By Martin Fowler 7. Testing Strategies in Microservices Architecture 8. Practical Test Pyramid By Ham Vocke Testing – TDD / BDD
  • 149. @arafkarsh arafkarsh 149 1. Simoorg : LinkedIn’s own failure inducer framework. It was designed to be easy to extend and most of the important components are plug‐ gable. 2. Pumba : A chaos testing and network emulation tool for Docker. 3. Chaos Lemur : Self-hostable application to randomly destroy virtual machines in a BOSH- managed environment, as an aid to resilience testing of high-availability systems. 4. Chaos Lambda : Randomly terminate AWS ASG instances during business hours. 5. Blockade : Docker-based utility for testing network failures and partitions in distributed applications. 6. Chaos-http-proxy : Introduces failures into HTTP requests via a proxy server. 7. Monkey-ops : Monkey-Ops is a simple service implemented in Go, which is deployed into an OpenShift V3.X and generates some chaos within it. Monkey-Ops seeks some OpenShift components like Pods or Deployment Configs and randomly terminates them. 8. Chaos Dingo : Chaos Dingo currently supports performing operations on Azure VMs and VMSS deployed to an Azure Resource Manager-based resource group. 9. Tugbot : Testing in Production (TiP) framework for Docker. Testing tools
  • 150. @arafkarsh arafkarsh References CI / CD 1. What is Continuous Integration? 2. What is Continuous Delivery? 3. CI / CD Pipeline 4. What is CI / CD Pipeline? 5. CI / CD Explained 6. CI / CD Pipeline using Java Example Part 1 7. CI / CD Pipeline using Ansible Part 2 8. Declarative Pipeline vs Scripted Pipeline 9. Complete Jenkins Pipeline Tutorial 10. Common Pipeline Mistakes 11. CI / CD for a Docker Application 150
  • 151. @arafkarsh arafkarsh References 151 DevOps 1. IBM: What is DevOps? 2. IBM: Cloud Native DevOps Explained 3. IBM: Application Transformation 4. IBM: Virtualization Explained 5. What is DevOps? Easy Way 6. DevOps?! How to become a DevOps Engineer??? 7. Amazon: https://www.youtube.com/watch?v=mBU3AJ3j1rg 8. NetFlix: https://www.youtube.com/watch?v=UTKIT6STSVM 9. DevOps and SRE: https://www.youtube.com/watch?v=uTEL8Ff1Zvk 10. SLI, SLO, SLA : https://www.youtube.com/watch?v=tEylFyxbDLE 11. DevOps and SRE : Risks and Budgets : https://www.youtube.com/watch?v=y2ILKr8kCJU 12. SRE @ Google: https://www.youtube.com/watch?v=d2wn_E1jxn4
  • 152. @arafkarsh arafkarsh References 152 1. Lewis, James, and Martin Fowler. “Microservices: A Definition of This New Architectural Term”, March 25, 2014. 2. Miller, Matt. “Innovate or Die: The Rise of Microservices”. e Wall Street Journal, October 5, 2015. 3. Newman, Sam. Building Microservices. O’Reilly Media, 2015. 4. Alagarasan, Vijay. “Seven Microservices Anti-patterns”, August 24, 2015. 5. Cockcroft, Adrian. “State of the Art in Microservices”, December 4, 2014. 6. Fowler, Martin. “Microservice Prerequisites”, August 28, 2014. 7. Fowler, Martin. “Microservice Tradeoffs”, July 1, 2015. 8. Humble, Jez. “Four Principles of Low-Risk Software Release”, February 16, 2012. 9. Zuul Edge Server, Ketan Gote, May 22, 2017 10. Ribbon, Hysterix using Spring Feign, Ketan Gote, May 22, 2017 11. Eureka Server with Spring Cloud, Ketan Gote, May 22, 2017 12. Apache Kafka, A Distributed Streaming Platform, Ketan Gote, May 20, 2017 13. Functional Reactive Programming, Araf Karsh Hamid, August 7, 2016 14. Enterprise Software Architectures, Araf Karsh Hamid, July 30, 2016 15. Docker and Linux Containers, Araf Karsh Hamid, April 28, 2015
  • 153. @arafkarsh arafkarsh References 153 16. MSDN – Microsoft https://msdn.microsoft.com/en-us/library/dn568103.aspx 17. Martin Fowler : CQRS – http://martinfowler.com/bliki/CQRS.html 18. Udi Dahan : CQRS – http://www.udidahan.com/2009/12/09/clarified-cqrs/ 19. Greg Young : CQRS - https://www.youtube.com/watch?v=JHGkaShoyNs 20. Bertrand Meyer – CQS - http://en.wikipedia.org/wiki/Bertrand_Meyer 21. CQS : http://en.wikipedia.org/wiki/Command–query_separation 22. CAP Theorem : http://en.wikipedia.org/wiki/CAP_theorem 23. CAP Theorem : http://www.julianbrowne.com/article/viewer/brewers-cap-theorem 24. CAP 12 years how the rules have changed 25. EBay Scalability Best Practices : http://www.infoq.com/articles/ebay-scalability-best-practices 26. Pat Helland (Amazon) : Life beyond distributed transactions 27. Stanford University: Rx https://www.youtube.com/watch?v=y9xudo3C1Cw 28. Princeton University: SAGAS (1987) Hector Garcia Molina / Kenneth Salem 29. Rx Observable : https://dzone.com/articles/using-rx-java-observable

Notes de l'éditeur

  1. HLS – HTTP Live Streaming
  2. HLS – HTTP Live Streaming
  3. HLS – HTTP Live Streaming
  4. HLS – HTTP Live Streaming
  5. HLS – HTTP Live Streaming
  6. HLS – HTTP Live Streaming
  7. HLS – HTTP Live Streaming
  8. IMPORTANT NOTES: If the watchType is set to FileProcessingMode.PROCESS_CONTINUOUSLY, when a file is modified, its contents are re-processed entirely. This can break the “exactly-once” semantics, as appending data at the end of a file will lead to all its contents being re-processed. If the watchType is set to FileProcessingMode.PROCESS_ONCE, the source scans the path once and exits, without waiting for the readers to finish reading the file contents. Of course the readers will continue reading until all file contents are read. Closing the source leads to no more checkpoints after that point. This may lead to slower recovery after a node failure, as the job will resume reading from the last checkpoint.