Yahoo recently open-sourced Pulsar, a highly scalable, low latency pub-sub messaging system running on commodity hardware. It provides simple pub-sub messaging semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication. Pulsar is used across various Yahoo applications for large scale data pipelines. Learn more about Pulsar architecture and use-cases in this talk.
Speakers:
Matteo Merli from Pulsar team at Yahoo
2. What is Pulsar?
2
▪ Hosted multi-tenant pub/sub messaging platform
▪ Simple messaging model
▪ Horizontally scalable - Topics, Message throughput
▪ Ordering, durability & delivery guarantees
▪ Geo-replication
▪ Easy to operate (Add capacity, replace machines)
▪ Few numbers for production usage:
› 1.5 years — 1.4 M topics — 100 B msg/day — Zero data loss
› Average publish latency < 5ms, 99pct 15ms
› 80+ application onboarded — Self-serve provisioning
› Presence in 8 data centers
Pulsar
3. Common use cases
3
▪ Application integration
› Server-to-server control, status propagation, notifications
▪ Persistent queue
› Stream processing, buffering, feed ingestion, tasks dispatcher
▪ Message bus for large scale data stores
› Durable log
› Replication within and across geo-locations
Pulsar
4. Main features
4
▪ REST / Java / Command line administrative APIs
› Provision users / grant permissions
› Users self-administration
› Metrics for topics / brokers usage
▪ Multi tenancy
› Authentication / Authorization
› Storage quota management
› Tenant isolation policies
› Message TTL
› Backlog and subscriptions management tools
▪ Message retention and replay
› Rollback to redeliver already acknowledged messages
Pulsar
5. Why build a new system?
5
▪ No existing solution to satisfy requirements
› Multi tenant — 1M topics — Low latency — Durability — Geo replication
▪ Kafka doesn’t scale well with many topics:
› Storage model based on individual directory per topic partition
› Enabling durability kills the performance
▪ Ability to manage large backlogs
▪ Operations are not very convenient
› eg: replacing a server, manual commands to copy the data and involves clients
› clients access to ZK clusters not desirable
▪ No scalable support to keep consumer position
Pulsar
6. Messaging Model
6 Pulsar
Consumer-A1 receives all messages published on T; B1, B2, B3 receive one third each
Shared
Exclusive
Consumer-B1
Consumer-B2
Consumer-B3
Topic-T
Subscription-B
Subscription-A Consumer-A1
Producer-X
Producer-Y
7. 7
Client API
Producer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Producer producer = client.createProducer(
"persistent://my-prop/us-west/my-ns/my-topic");
// Handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync(“my-message”.getBytes())
.thenAccept(msgId -> {
// Message was persisted
});
Consumer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-prop/us-west/my-ns/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
// Process message …
// Acknowledge the message so that
// it can be deleted by broker
consumer.acknowledge(msg);
}
Pulsar
8. Main client library features
8
▪ Sync / Async operations
▪ Partitioned topics
▪ Transparent batching of messages
▪ Compression
▪ End-to-end checksum
▪ TLS encryption
▪ Individual and cumulative acknowledgment
▪ Client side stats
Pulsar
9. Architecture
9 Pulsar
Separate layers
between brokers and
storage (bookies)
‣ Broker and bookies can
be added
independently
‣ Traffic can be shifted
very quickly across
brokers
‣ New bookies will ramp
up on traffic quickly
Pulsar Cluster
ZK
Producer Consumer
Broker 1 Broker 3
Bookie
1
Bookie
2
Bookie
3
Bookie
4
Bookie
5
Broker 2
11. BookKeeper
11
▪ Replicated log service
▪ Offer consistency and durability
▪ Why is it a good choice for Pulsar?
› Very efficient storage for sequential data
› For each topic we are creating multiple ledgers over time
› Very good distribution of IO across all bookies
› Isolation of write and reads
› Flexible model for quorum writes with different tradeoffs
Pulsar
12. BookKeeper - Storage
12
▪ A single bookie can serve
and store thousands of
ledgers
▪ Writes to journal, reads
come from ledger device:
› Avoid read activity to impact
write latency
› Writes are added to in-
memory write-cache and
committed to journal
› Write cache is flushed in
background to separated
ledger device
▪ Entries are sorted to allow
for mostly sequential reads
Pulsar
14. Final Remarks
• Check out the code and docs at github.com/yahoo/pulsar
• Give feedback or ask for more details on mailing lists:
• Pulsar-Users
• Pulsar-Dev