All Day DevOps - FLiP Stack for Cloud Data Lakes

TRACK: MODERN INFRASTRUCTURE
NOVEMBER 10, 2022
Timothy Spann, StreamNative
FLiP Stack for Cloud Data
Lakes

Tim Spann
Developer Advocate
Tim Spann, Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar, Flink,
Spark, NiFi, Big Data, Cloud, MXNet, IoT and more.
○ Today, he helps to grow the Pulsar community sharing rich technical knowledge
and experience at both global conferences and through individual conversations.

FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar,
Apache NiFi, Apache Spark and open
source friends.
https://bit.ly/32dAJft

Apache Pulsar
Serverless computing framework.
Unbounded storage, multi-tiered
architecture, and tiered-storage.
Streaming & Pub/Sub messaging
semantics.
Multi-protocol support.
Open Source
Cloud-Native

Why Apache Pulsar?
Uniﬁed
Messaging Platform
Guaranteed
Message Delivery
Resiliency
Inﬁnite
Scalability

Unified Messaging Model
Simplify your data infrastructure and
enable new use cases with queuing and
streaming capabilities in one platform.
Multi-tenancy
Enable multiple user groups to share the
same cluster, either via access control, or
in entirely different namespaces.
Scalability
Decoupled data computing and storage
enable horizontal scaling to handle data
scale and management complexity.
Geo-replication
Support for multi-datacenter replication
with both asynchronous and
synchronous replication for built-in
disaster recovery.
Tiered storage
Enable historical data to be offloaded to
cloud-native storage and store event
streams for indefinite periods of time.
Pulsar Benefits

Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Uniﬁed Messaging + Data Streaming

Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
.. and Streaming
Works best in situations where the
order of messages is important—for
example, data ingestion.
Kafka and Amazon Kinesis are
examples of messaging systems that
use streaming semantics for
consuming messages.
Pulsar: Uniﬁed Messaging + Data Streaming

Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Cluster

Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that
producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an
arbitrary payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named conﬁguration rules
that determine how messages are delivered
to consumers.
● Consumers receive messages.

Pulsar Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer
D-2
Key-Shared
<
K
1
,V
1
0
>
<
K
1
,V
1
1
>
<
K
1
,V
1
2
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1
>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1
,V
1
0
>
<
K
2
,V
2
1
>
<
K
1
,V
1
2
>
<
K
2
,V
2
0
>
<
K
1
,V
1
1
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover

Producer-Consumer
Producer Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it
Topic
Topic

Schema Registry
Schema Registry
schema-1
(value=Avro/Protobuf/JSON)
schema-2
schema-3
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers

Use Pulsar to Stream to Lakehouses

Use Pulsar to Stream from Lakehouses

• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a 300 sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
Why Apache NiFi?

Apache NiFi Pulsar Connector
https://github.com/david-streamlio/pulsar-nifi-bundle

Apache NiFi - Data Lineage / Provenance

https://www.datainmotion.dev/2021/01/automating-starting-services-in-apache.html
https://nipyapi.readthedocs.io/en/latest/
nifi-toolkit/bin/cli.sh nifi list-param-contexts -u http:/
/edge2ai-1.dim.local:8080
nifi-toolkit/bin/cli.sh nifi pg-list -u http:/
/edge2ai-1.dim.local:8080
nifi-toolkit/bin/cli.sh nifi pg-set-param-context -u http:/
/edge2ai-1.dim.local:8080 ...
Apache NiFi DevOps
Apache NiFi DevOps

https://dev.to/tspannhw/automating-starting-services-in-apache-nifi-and-applying-parameters-5h4n
https://github.com/tspannhw/ApacheConAtHome2020/blob/main/scripts/setupnifi.sh
nifi pg-list
nifi pg-status
nifi pg-get-services
nifi pg-enable-services -u http:/
/edge2ai-1.dim.local:8080 --processGroupId root
nifi pg-start -u http:/
/edge2ai-1.dim.local:8080 -pgid LOOKTHISUP
nifi list-param-contexts -u http:/
/edge2ai-1.dim.local:8080 -verbose
nifi create-reporting-task -u http:/
/edge2ai-1.dim.local:8080 -verbose -i
Apache NiFi DevOps

Download NiFi Toolkit
Copy keystore and truststore information from your NiFi conf/nifi.properties
Create a nifi.properties file linked to the cli.sh
baseUrl=https://nvidia-desktop:8443
keystore=/home/nvidia/nvme/nifi-1.15.3/conf/keystore.p12
keystoreType=PKCS12
keystorePasswd=5325343412efaab3123c6892d93
keyPasswd=53134eee99da9dbe9349123aa17c6892d93
truststore=/home/nvidia/nvme/nifi-1.15.3/conf/truststore.p12
truststoreType=PKCS12
truststorePasswd=93498Dfdjfhujdhure8d8hfd84j3n43jd
Apache NiFi Toolkit Setup

● Uniﬁed computing engine
● Batch processing is a special case of stream processing
● Stateful processing
● Massive Scalability
● Flink SQL for queries, inserts against Pulsar Topics
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
Why Apache Flink?

SQL
select aqi, parameterName, dateObserved, hourObserved, latitude, longitude, localTimeZone,
stateCode, reportingArea from airquality;
select max(aqi) as MaxAQI, parameterName, reportingArea from airquality group by
parameterName, reportingArea;
select max(aqi) as MaxAQI, min(aqi) as MinAQI, avg(aqi) as AvgAQI, count(aqi) as RowCount,
parameterName, reportingArea from airquality group by parameterName, reportingArea;

StreamNative Hub
StreamNative Cloud
Uniﬁed Batch and Stream
COMPUTING
Batch
(Batch + Stream)
Uniﬁed Batch and Stream
STORAGE
Offload
(Queuing + Streaming)
Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Pulsar
Sink
Streaming
Data Gateway
Protocols
Data to Cloud Data Lake
Micro
Service

Monitoring and Metrics Check
curl http://localhost:8080/admin/v2/persistent/conf/ete/first/stats |
python3 -m json.tool
bin/pulsar-admin topics stats-internal persistent://conf/ete/first
curl http://pulsar1:8080/metrics/
bin/pulsar-admin topics stats-internal persistent://conf/ete/first
bin/pulsar-admin topics peek-messages --count 5 --subscription ete-reader
persistent://conf/ete/first
bin/pulsar-admin topics subscriptions persistent://conf/ete/first

Cleanup
bin/pulsar-admin topics delete persistent://conf/ete/first
bin/pulsar-admin namespaces delete conf/ete
bin/pulsar-admin tenants delete conf

Metrics: Broker
Broker metrics are exposed under "/metrics"
at port 8080.
You can change the port by updating
webServicePort to a different port in the
broker.conf configuration file.
All the metrics exposed by a broker are labeled
with cluster=${pulsar_cluster}.
The name of Pulsar cluster is the value of
${pulsar_cluster}, configured in the
broker.conf file.
These metrics are available for brokers:
● Namespace metrics
○ Replication metrics
● Topic metrics
○ Replication metrics
● ManagedLedgerCache metrics
● ManagedLedger metrics
● LoadBalancing metrics
○ BundleUnloading metrics
○ BundleSplit metrics
● Subscription metrics
● Consumer metrics
● ManagedLedger bookie client metrics
For more information: https://pulsar.apache.org/docs/en/reference-metrics/#broker

Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PaaSDev
https://www.linkedin.com/in/timothyspann
https://github.com/tspannhw

All Day DevOps - FLiP Stack for Cloud Data Lakes

Recommandé

Recommandé

Contenu connexe

Similaire à All Day DevOps - FLiP Stack for Cloud Data Lakes

Similaire à All Day DevOps - FLiP Stack for Cloud Data Lakes (20)

Plus de Timothy Spann

Plus de Timothy Spann (20)

Dernier

Dernier (20)

All Day DevOps - FLiP Stack for Cloud Data Lakes