https://www.alldaydevops.com/addo-speakers/timothy-spann
Timothy Spann
StreamNative
MODERN INFRASTRUCTURE
SHARE THIS SESSION
Session Name: FLiP Stack for Cloud Data Lakes
Utilizing an all Apache stack for Rapid Data Lake Population and querying utilizing Apache Flink, Apache Pulsar, and Apache NiFi. We can quickly stream data to and from any datalake, data lake house, lakehouse, database or any datamart regardless of cloud or size. FLiP allows for Java and Python developers to build scalable solutions that span messaging and streaming in cloud native fashion with full monitoring.
Speaker Bio:
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData, and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, Pulsar, NiFi, the blockchain, and Spark.
2. TRACK: MODERN INFRASTRUCTURE
Tim Spann
Developer Advocate
Tim Spann, Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar, Flink,
Spark, NiFi, Big Data, Cloud, MXNet, IoT and more.
○ Today, he helps to grow the Pulsar community sharing rich technical knowledge
and experience at both global conferences and through individual conversations.
3. TRACK: MODERN INFRASTRUCTURE
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar,
Apache NiFi, Apache Spark and open
source friends.
https://bit.ly/32dAJft
7. TRACK: MODERN INFRASTRUCTURE
Unified Messaging Model
Simplify your data infrastructure and
enable new use cases with queuing and
streaming capabilities in one platform.
Multi-tenancy
Enable multiple user groups to share the
same cluster, either via access control, or
in entirely different namespaces.
Scalability
Decoupled data computing and storage
enable horizontal scaling to handle data
scale and management complexity.
Geo-replication
Support for multi-datacenter replication
with both asynchronous and
synchronous replication for built-in
disaster recovery.
Tiered storage
Enable historical data to be offloaded to
cloud-native storage and store event
streams for indefinite periods of time.
Pulsar Benefits
8. TRACK: MODERN INFRASTRUCTURE
Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Unified Messaging + Data Streaming
9. TRACK: MODERN INFRASTRUCTURE
Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
.. and Streaming
Works best in situations where the
order of messages is important—for
example, data ingestion.
Kafka and Amazon Kinesis are
examples of messaging systems that
use streaming semantics for
consuming messages.
Pulsar: Unified Messaging + Data Streaming
11. TRACK: MODERN INFRASTRUCTURE
Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that
producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an
arbitrary payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named configuration rules
that determine how messages are delivered
to consumers.
● Consumers receive messages.
12. TRACK: MODERN INFRASTRUCTURE
Pulsar Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer
D-2
Key-Shared
<
K
1
,V
1
0
>
<
K
1
,V
1
1
>
<
K
1
,V
1
2
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1
>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1
,V
1
0
>
<
K
2
,V
2
1
>
<
K
1
,V
1
2
>
<
K
2
,V
2
0
>
<
K
1
,V
1
1
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
13. TRACK: MODERN INFRASTRUCTURE
Producer-Consumer
Producer Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it
Topic
Topic
14. TRACK: MODERN INFRASTRUCTURE
Schema Registry
Schema Registry
schema-1
(value=Avro/Protobuf/JSON)
schema-2
(value=Avro/Protobuf/JSON)
schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
17. TRACK: MODERN INFRASTRUCTURE
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a 300 sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
Why Apache NiFi?
23. TRACK: MODERN INFRASTRUCTURE
Download NiFi Toolkit
Copy keystore and truststore information from your NiFi conf/nifi.properties
Create a nifi.properties file linked to the cli.sh
baseUrl=https://nvidia-desktop:8443
keystore=/home/nvidia/nvme/nifi-1.15.3/conf/keystore.p12
keystoreType=PKCS12
keystorePasswd=5325343412efaab3123c6892d93
keyPasswd=53134eee99da9dbe9349123aa17c6892d93
truststore=/home/nvidia/nvme/nifi-1.15.3/conf/truststore.p12
truststoreType=PKCS12
truststorePasswd=93498Dfdjfhujdhure8d8hfd84j3n43jd
Apache NiFi Toolkit Setup
24. TRACK: MODERN INFRASTRUCTURE
● Unified computing engine
● Batch processing is a special case of stream processing
● Stateful processing
● Massive Scalability
● Flink SQL for queries, inserts against Pulsar Topics
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
Why Apache Flink?
25. TRACK: MODERN INFRASTRUCTURE
SQL
select aqi, parameterName, dateObserved, hourObserved, latitude, longitude, localTimeZone,
stateCode, reportingArea from airquality;
select max(aqi) as MaxAQI, parameterName, reportingArea from airquality group by
parameterName, reportingArea;
select max(aqi) as MaxAQI, min(aqi) as MinAQI, avg(aqi) as AvgAQI, count(aqi) as RowCount,
parameterName, reportingArea from airquality group by parameterName, reportingArea;
26. TRACK: MODERN INFRASTRUCTURE
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream
COMPUTING
Batch
(Batch + Stream)
Unified Batch and Stream
STORAGE
Offload
(Queuing + Streaming)
Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Pulsar
Sink
Streaming
Data Gateway
Protocols
Data to Cloud Data Lake
Micro
Service
29. TRACK: MODERN INFRASTRUCTURE
Metrics: Broker
Broker metrics are exposed under "/metrics"
at port 8080.
You can change the port by updating
webServicePort to a different port in the
broker.conf configuration file.
All the metrics exposed by a broker are labeled
with cluster=${pulsar_cluster}.
The name of Pulsar cluster is the value of
${pulsar_cluster}, configured in the
broker.conf file.
These metrics are available for brokers:
● Namespace metrics
○ Replication metrics
● Topic metrics
○ Replication metrics
● ManagedLedgerCache metrics
● ManagedLedger metrics
● LoadBalancing metrics
○ BundleUnloading metrics
○ BundleSplit metrics
● Subscription metrics
● Consumer metrics
● ManagedLedger bookie client metrics
For more information: https://pulsar.apache.org/docs/en/reference-metrics/#broker
30. TRACK: MODERN INFRASTRUCTURE
Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PaaSDev
https://www.linkedin.com/in/timothyspann
https://github.com/tspannhw