Open Source Bristol 30 March 2022
https://www.meetup.com/Open-Source-Bristol/events/284198269/
18:35 // 'Building a Scalable Event Streaming and Messaging Platform using Apache Pulsar for Fintech' // Tim Spann and John Kinson
Today, companies are adopting Apache Pulsar, an open-source messaging and event streaming platform. Pulsar’s scalability and cloud-native capabilities make it uniquely positioned to meet a range of emerging business needs, including AdTech, fraud detection, IoT analytics, microservices development, and payment processing.
Tim Spann and John Kinson will share insights into the modern data streaming landscape, how Apache Pulsar fits into it, and how it can be used for Fintech. John will also talk about the origins of StreamNative as a Commercial Open Source Software company, and how that has shaped the go-to-market strategy.
2. streamnative.io
Tim Spann
Developer Advocate
StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Apache
Pulsar, Apache Flink, Apache Spark, Apache NiFi, Big Data, Cloud,
Trino, Aerospike, IoT and more.
John Kinson
Head of Sales, EMEA
StreamNative
● Startup, Scale-up and Large Enterprise expert
● Building the StreamNative Sales function in EMEA
● Experience:
○ 25+ years of building and selling distributed and embedded systems in
the telecoms, digital media and cloud enterprise software industries
3. Agenda
01 Welcome
02 Introduction to Messaging + Data Streaming
03 Introduction to Apache Pulsar
04 Why Open Source
05 Resources
06 Q&A
3
4. 4
➔ Asynchronous messages triggered by
events
➔ Consuming messages regardless of
Language, System, Sender
➔ Queueing
➔ Routing
➔ Work Queues
➔ JPMorgan Chase AMQP
MESSAGING
5. 5
➔ Perform in Real-Time
➔ Process Events as They Happen
➔ Joining Streams with SQL
➔ Find Anomalies Immediately
➔ Ordering and Arrival Semantics
➔ Continuous Streams of Data
DATA STREAMING
6. streamnative.io
Accessing historical as well as
real-time data
Pub/sub model enables event streams
to be sent from multiple producers,
and consumed by multiple consumers
To process large amounts of data in a
highly scalable way
When is Messaging and
Streaming used?
7. Industry trends
Banking
Transforming from
siloed systems
to combined data streams
Provide faster claim
processing, fraud detection and
system integration
Insurance
Handle huge columns of
data from sensors
IoT
7
8. Apache Pulsar is a Cloud-Native Messaging
and Event-Streaming Platform.
9. Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Unified Messaging + Data Streaming
10. Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Unified Messaging + Data Streaming
.. and Streaming
Works best in situations where the
order of messages is important—for
example, data ingestion.
Kafka and Amazon Kinesis are
examples of messaging systems that
use streaming semantics for
consuming messages.
14. Using Pulsar with Fintech
14
Low latency
Geo-replication
Data integrity
High availability
Durability
Multi-tenancy
Multiple data consumers:
Transactions, payment
processing, alerts,
analytics, KYC, fraud
detection with ML & AI
Large data volumes,
high scalability
Financial event
messaging
Many topics, producers,
consumers
15. Why Open
Source Pulsar?
Sijie Guo
ASF Member
Pulsar/BookKeeper PMC
Founder and CEO
Jia Zhai
Pulsar/BookKeeper PMC
Co-Founder
Matteo Merli
ASF Member
Pulsar/BookKeeper PMC
CTO
16. 16
● We would get many benefits from an
open source model
○ Other companies would help
develop the product
○ Better security, code escrow,
longevity
● We would keep the core features in the
OSS version
● We could build commercial offerings,
services around the core product
OUR BETS AND EARLY DECISIONS
Why Open
Source Pulsar?
23. streamnative.io
Industry trends
Notable industries and sectors using data streaming:
Banking - transforming from siloed systems to combined data streams
○ Typical applications of event streaming include banking sector processing of
financial transactions, with multiple customer touchpoints, notifications, and
support for mobile devices
○ Banking data (transactions and meta data) can be streamed in parallel for
fraud detection using ML and AI in near real-time
Insurance - building a single view from multiple data sources to provide faster claim
processing, fraud detection and system integration
IoT - handling huge volumes of data from sensors
24. Adopted Pulsar to replace
Kafka in their DSP (Data
Streaming Platform).
● 1.5-2x lower in capex
cost
● 5-50x improvement in
latency
● 2-3x lower in opex due
● Process 10
petabytes/day
Adopted Pulsar to power
their billing platform,
Midas, which processing
hundreds of billions of
financial transactions daily.
Adoption then expanded to
Tencent’s Federated
Learning Platform and
Tencent Gaming.
Applied Materials is one of
the biggest semiconductor
hardware and software
supplier in the industry.
They adopted Pulsar to
enable them to build a
message bus to tie all of
their data together. They
previously used Tibco.
Pulsar Adoption Use Cases
25. Agenda
Welcome
Introduction to Messaging + Data Streaming
● What is messaging and data streaming?
● When is it used?
● What are the industry trends?
Introduction to Apache Pulsar
● What it is
● What it enables
● Who uses it today?
● Using Apache Pulsar in FinTech applications
Why Open Source
● Why open source Apache Pulsar?
● What have been the benefits and challenges?
Resources
Q&A
26. Industry trends
Banking
Transforming from
siloed systems
to combined data streams
Provide faster claim
processing, fraud detection and
system integration
Insurance
Handle huge columns of
data from sensors
IoT
26
27. Pulsar Adoption Spreads
Tencent serves billions of users and over a million merchants.
Use Case #1: Payments
Early 2019, Tencent
adopts Pulsar to power
their billing platform,
Midas, processing
hundreds of billions of
financial transactions
daily.
Use Case #2: ML/AI
Pulsar adoption
spreads to Tencent’s
Federated Learning
Platform where it
supports trillions of
concurrent federated
learnings every day.
Use Case #3: Gaming
Tencent’s Gaming
Department replaces
Kafka with Pulsar for
its logging pipeline.
28. Founded By The
Creators Of Apache Pulsar
Sijie Guo
ASF Member
Pulsar/BookKeeper PMC
Founder and CEO
Jia Zhai
Pulsar/BookKeeper PMC
Co-Founder
Matteo Merli
ASF Member
Pulsar/BookKeeper PMC
CTO
Data veterans with extensive industry experience
29. Messages - the basic unit of Pulsar
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although
message data can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for
things like topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer
name, the default name is used. Message De-Duplication.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of
the message is its order in that sequence. Message De-Duplication.
30. Producer-Consumer
Producer Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it
Topic
Topic
31. Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that producers
use to transmit messages to subscribed consumers.
● Messages belong to a topic and contain an arbitrary
payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named configuration rules
that determine how messages are delivered to
consumers.
● Consumers receive messages.
32. Pulsar Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
33. Messaging
Ordering Guarantees
Topic Ordering Guarantees:
● Messages sent to a single topic or
partition DO have an ordering
guarantee.
● Messages sent to different partitions
DO NOT have an ordering guarantee.
33
Subscription Mode Guarantees:
● A single consumer can receive
messages from the same partition in
order using an exclusive or failover
subscription mode.
● Multiple consumers can receive
messages from the same key in order
using the key_shared subscription
mode.
34. Messaging
Ordering Guarantees
Topic Ordering Guarantees:
● Messages sent to a single topic or
partition DO have an ordering
guarantee.
● Messages sent to different partitions
DO NOT have an ordering guarantee.
34
Subscription Mode Guarantees:
● A single consumer can receive
messages from the same partition in
order using an exclusive or failover
subscription mode.
● Multiple consumers can receive
messages from the same key in order
using the key_shared subscription
mode.
35. Unified Messaging Model
Streaming
Messaging
Producer 1
Producer 2
Pulsar
Topic/Partition
m0
m1
m2
m3
m4
Consumer D-1
Consumer D-2
Consumer D-3
Subscription D
<
k
2
,
v
1
>
<
k
2
,
v
3
>
<k3,v2>
<
k
1
,
v
0
>
<
k
1
,
v
4
>
Key-Shared
Consumer C-1
Consumer C-2
Consumer C-3
Subscription C
m1
m2
m3
m4
m0
Shared
Failover
Consumer B-1
Consumer B-0
Subscription B
m1
m2
m3
m4
m0
In case of failure in
Consumer B-0
Consumer A-1
Consumer A-0
Subscription A
m1
m2
m3
m4
m0
Exclusive
X
38. Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
39. Pulsar Functions
● Lightweight computation
similar to AWS Lambda.
● Specifically designed to use
Apache Pulsar as a message
bus.
● Function runtime can be
located within Pulsar Broker.
A serverless event streaming
framework
40. ● Consume messages from one
or more Pulsar topics.
● Apply user-supplied
processing logic to each
message.
● Publish the results of the
computation to another topic.
● Support multiple
programming languages (Java,
Python, Go)
● Can leverage 3rd-party
libraries to support the
execution of ML models on
the edge.
Pulsar Functions
41. Moving Data In and Out of Pulsar
IO/Connectors are a simple way to integrate with external systems and move
data in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/
● Built on top of Pulsar Functions
● Built-in connectors - hub.streamnative.io
Source Sink
45. Review: Key Pulsar Terminology
● Producer is a process that publishes messages to a topic.
● Consumer is a process that establishes a subscription to a topic
and processes messages published to that topic.
● Subscription: A subscription is a named configuration rule that
determines how messages are delivered to consumers. Four
subscription modes are available in Pulsar: exclusive, shared,
failover, and key-shared.
● Brokers handle the connections and routes messages.
● Topics are named channels for transmitting messages from
producers to consumers. Partitioned Topics are “virtual” topics
composed of multiple topics.
● Messages belong to a topic and contain an arbitrary payload.
● Instance is a group of clusters that
act together as a single unit.
● Cluster is a set of Pulsar brokers,
ZooKeeper quorum, and an
ensemble of BookKeeper bookies.
● Tenants are the administrative unit
for allocating capacity and enforcing
an authentication/ authorization
scheme.
● Namespaces are a grouping
mechanism for related topics.
46. The Need For Real-Time Data
Hybrid and multi-cloud
strategies with native
geo-replication
Seamlessly build
microservice architectures
with support for streaming
and messaging workloads
Built for Kubernetes
CloudNative
migrations with tools
360 degree customer data
multi-tenancy, infinite
retention, and extensive
connector ecosystem
47. streamnative.io
Tim Spann
Developer Advocate
StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Apache
Pulsar, Apache Flink, Apache Spark, Apache NiFi, Big Data, Cloud,
Trino, Aerospike, IoT and more.
48. Background
● Provides a data platform
for the cloud
● Customers include 92 of
the Fortune 100
● Core use cases include
real-time monitoring,
interactive applications,
log processing & analytics,
IOT analytics, streaming
data transformation,
real-time analytics &
event-driven workflows
Why Pulsar
● Scalability
● Durability
● Fault Tolerance
● High Availability
● Sharing & Isolation
● Messaging Models
● Persistence
● Client Languages
● Deployment in k8s
● Operability
● Disaster REcovery
● TCO
● Community & Adoption
Benefits
● 1.5-2x lower in capex
cost
● 5-50x improvement in
latency
● 2-3x lower in opex due to
layered architecture
● Processes billions of
messages/day in
production
49. Background
● The third-largest payment
provider in China behind
Alipay and WeChat
Payment
● 500 million registered users
and 41.9 million active users
● Need to improve the
efficiency of fraud detection
for mobile payments
● Current lambda architecture
of Kafka + Hive is complex
and difficult to maintain
Benefits
● Reduce complexity by 33%
(clusters reduced from six to
four)
● Improve production
efficiency by 11 times
● Higher stability due to the
unified architecture
Why Pulsar
● Cloud-native architecture
and segment-centric
storage
● Pulsar is able to do both
streaming and batch
processing
● Able to build a unified
data processing stack
with Pulsar and Spark,
streamlining messy
operations problems
50. StreamNative Customer Spotlight:
Background
● Flipkart is the largest
e-commerce company
in India with $6B+ in
annual revenue
● Company-wide
messaging platform,
supporting different
types of streaming use
cases, including:
payment processing,
order tracking,
warehouse, logistics, etc.
Why StreamNative
● Work with the original
developers of Pulsar and
top Pulsar engineers
● Experience operating
large scale,
geo-replicated
messaging systems
● 24 x 7 support to
support mission-critical
business applications
Benefits
● Able to handle spikes in
traffic without manual
rebalancing or system failure
● Reduced operational
complexity and total cost of
ownership
● Support the move to cloud
51. StreamNative Customer Spotlight:
Background
● Narvar provides
e-commerce supply chain
management software,
powering 300 retailers and
650 brands
● Core use case:
asynchronous processing
to distribute tasks between
the various systems,
including individual
retailers’ ordering and
warehouse management
applications
Why StreamNative
● Work with the original
developers of Pulsar and
top Pulsar engineers
● “Before we began working
with StreamNative, Sijie
Guo and his team helped us
work out some production
issues. We were very
impressed by how quickly
they solved our problems
and their willingness to
help.” - Ankush Goyal
Benefits
● Accelerate application
development
● Able to handle spikes in
traffic without manual
rebalancing or system failure
● Reduced customer issues
52. streamnative.io
Passionate and dedicated team.
Founded by the original developers of
Apache Pulsar.
StreamNative helps teams to capture,
manage, and leverage data using Pulsar’s
unified messaging and streaming
platform.
55. Why Open
Source Pulsar?
Sijie Guo
ASF Member
Pulsar/BookKeeper PMC
Founder and CEO
Jia Zhai
Pulsar/BookKeeper PMC
Co-Founder
Matteo Merli
ASF Member
Pulsar/BookKeeper PMC
CTO
● Other companies would help develop the
product
● We could build commercial offerings, services
around the core product
● We would get many benefits from an open
source model