Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
1
1
1
Streaming millions of
Contact Center
interactions in (near)
real-time with Pulsar
Frank Kelly
Principal Engineer, Co...
2
● Cogito & What we do
● Architecture & Use-Cases
● Challenges
● Initial lessons learned
● Kubernetes lessons learned
● P...
3
Formed in 2007 out of MIT - based out of Boston - now with a Global Engineering Footprint
Vision: Elevating the human co...
4
Architecture
5
● Streaming: Real-time audio and analytic
results from our AI/ML models
● We break each customer call into
separate logi...
6
● Streaming Use-Case
○ Lots of throughput ~ 10 Gbps
○ Message-ordering & deduplication are critical
○ Near real-time req...
7
● Processing real-time binary streams
○ Consumer: SubscriptionInitialPosition.Earliest
○ Broker Configuration: brokerDedu...
8
● 15k Users ⇒ ingress of 5 Gbps Audio Data ⇒ 20 TB in a 12 hour window
● Open Subscriptions keep the topic data from bei...
9
● Which Helm chart?
○ Apache Pulsar (“Official”) vs Streamnative (Also “Official”) vs Kafkaesque
● GC Settings
○ Java Ergono...
10
● Config: exposeTopicLevelMetricsInPrometheus: "false"
Passive Monitoring with Prometheus / Grafana
11
Active Monitoring with Prometheus Alerts
Integration with Prometheus Alerting to Slack / PagerDuty
12
● Namespace Bundles
○ For 15 Brokers: defaultNumberOfNamespaceBundles: "128" (Default: 4)
● Pulsar Load Balancer
○ # Di...
13
● Error
○ PerChannelBookieClient - Add for failed on bookie bookkeeper-2:3181 code EIO
Bookie EIO Error
Root Cause: At ...
14
● Key Prometheus Metrics
○ Bookie
■ bookie_throttled_write_requests
■ bookie_rejected_write_request
○ Broker
■ pulsar_m...
15
● EBS drives for Journal & Ledger
○ GP3 with max settings 16000 IOPS, 1000 MB/s
● Broker Cache
○ managedLedgerCacheEvic...
16
We’re not at millions yet but we’re seeing a trend . . . .
1) Simulated 300 users for about 18 hours with artificially s...
17
Observations: ZooKeeper
ZK JVM Heap demands increasing . . .
18
Observations: ZooKeeper
ZK Disk Usage Increasing . . .
Suppressed: java.io.IOException: No space left on device
at
org....
19
Observations: ZooKeeper
ZK 99%ile response times increasing. . .
20
Observations: Broker
Broker Heap Increasing . . .
Topic Metadata here as well as in ZK
21
Implications
1) ZooKeeper
a) More Heap
b) More CPU for GC (and to avoid throttling during GC)
c) Watch ZooKeeper disk s...
22
Recap: Key Metrics for our Streaming Use-Case
23
Thanks
Cogito
Bruce, Hamid, Andy, Jimmy, George, Gibby, Kyle, Matt, Amanda, John,
Ian, Mihai, Luis, Anthony, Karl and m...
24
Thank you!
25
● Benchmarking Pulsar and Kafka - A More Accurate Perspective on Pulsar’s Performance
○ https://streamnative.io/en/blog...
Prochain SlideShare
Chargement dans…5
×

1

Partager

Télécharger pour lire hors ligne

Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pulsar - Pulsar Summit NA 2021

Télécharger pour lire hors ligne

Lessons learned in how to configure Pulsar on Kubernetes to handle millions of topics.

Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pulsar - Pulsar Summit NA 2021

  1. 1. 1 1 1 Streaming millions of Contact Center interactions in (near) real-time with Pulsar Frank Kelly Principal Engineer, Cogito Corp Slack: https://apache-pulsar.slack.com/ A panoply of parameters
  2. 2. 2 ● Cogito & What we do ● Architecture & Use-Cases ● Challenges ● Initial lessons learned ● Kubernetes lessons learned ● Performance & Scaling settings ● Results ● Q&A Intended Audience Those who understand the main APIs and components but who may not be familiar with all the configuration settings or how to optimize the system for high write throughput and/or millions of topics. Overview
  3. 3. 3 Formed in 2007 out of MIT - based out of Boston - now with a Global Engineering Footprint Vision: Elevating the human connection in real time . . . . Product: Call center AI solution that analyzes the human voice and provides real-time guidance to enhance emotional intelligence and customer service. Cogito: Who we are and what we do
  4. 4. 4 Architecture
  5. 5. 5 ● Streaming: Real-time audio and analytic results from our AI/ML models ● We break each customer call into separate logical units called “intervals” ● Each interval is backed by two Pulsar topics ○ Real-time Audio Topic ○ Real-time Analytics Topic ● Splicing up binary formats into discrete messages → Deduplication is VERY important! ● With 15,000 concurrent users - we estimate 1.5m to 2m topics per day ● Each topic has moderate throughput ~ 32 Kb/s ● Also Messaging: Work-Queue events Use-Cases for Pulsar
  6. 6. 6 ● Streaming Use-Case ○ Lots of throughput ~ 10 Gbps ○ Message-ordering & deduplication are critical ○ Near real-time requirements (< 250ms) ■ Think about timeouts/retries/failover ● Challenges ○ Zookeeper stores all the topics for a namespace under one ZNode ○ Brokers require more memory ● Alternatives considered ○ Using key_shared would require us to disable batching in the producer (not a huge deal) ○ Risk: Message dispatch will stop if there is a subscription / consumer that has built up a backlog of messages in their hash-range ○ Filtering on the client-side The Challenges
  7. 7. 7 ● Processing real-time binary streams ○ Consumer: SubscriptionInitialPosition.Earliest ○ Broker Configuration: brokerDeduplicationEnabled: "true" ● Client Performance ○ Producer: sendAsync() ~10x improvement ○ Producer: blockIfQueueFull(true) ○ Batching: Enabled but the throughput per Producer is so low it rarely becomes helpful ● Default Timeouts ○ For our real-time system the default connection / operation timeout of 30s is too high ● Persistent vs. Non-Persistent ○ We support both use-cases (some customers wish for zero persistence) Initial Lessons on the basics
  8. 8. 8 ● 15k Users ⇒ ingress of 5 Gbps Audio Data ⇒ 20 TB in a 12 hour window ● Open Subscriptions keep the topic data from being deleted ○ Code: pulsarAdmin.namespaces().setSubscriptionExpirationTime()); ○ Broker Deduplication has its own subscription ■ brokerDeduplicationEntriesInterval: "50" (default: 1000) ■ brokerDeduplicationProducerInactivityTimeoutMinutes: "15" (default: 360) ● Bookie Compaction Thresholds (Delete more and do it more frequently) ○ majorCompactionInterval / majorCompactionThreshold ○ minorCompactionInterval / minorCompactionThreshold ○ compactionRate ● Tiered Storage ○ Although we use some Tiered storage there will be too many topics in ZK over time ○ Created our own Stream Offload that stores S3 location in RDS DB Disk Space Challenges
  9. 9. 9 ● Which Helm chart? ○ Apache Pulsar (“Official”) vs Streamnative (Also “Official”) vs Kafkaesque ● GC Settings ○ Java Ergonomics: -XX:+PrintFlagsFinal ○ GC Settings tied to Pod Memory: -Xms2g -Xmx2g -XX:MaxDirectMemorySize=6g ○ resources.requests.memory = Heap + Direct Memory + Some Buffer ○ Looking forward to seeing modern JVM settings e.g. -XX:MaxRAMPercentage=75% ● Most helm charts set requests but not limits. We set requests == limits ○ JVM Memory is not elastic ○ CPU is however we experienced a lot of throttling from K8S Scheduler ● Istio Service Mesh ○ Integration with Istio for mTLS and service-level authorization took a chunk of time Kubernetes Lessons
  10. 10. 10 ● Config: exposeTopicLevelMetricsInPrometheus: "false" Passive Monitoring with Prometheus / Grafana
  11. 11. 11 Active Monitoring with Prometheus Alerts Integration with Prometheus Alerting to Slack / PagerDuty
  12. 12. 12 ● Namespace Bundles ○ For 15 Brokers: defaultNumberOfNamespaceBundles: "128" (Default: 4) ● Pulsar Load Balancer ○ # Disable Bundle split due to https://github.com/apache/pulsar/issues/5510 ○ loadBalancerAutoBundleSplitEnabled: "false" ● Balancing throughput, durability and reliability across Bookies ○ managedLedgerDefaultEnsembleSize: "N" ○ managedLedgerDefaultWriteQuorum: "2" ○ managedLedgerDefaultAckQuorum: "1" ○ Striping is great for write-throughput but adds cost for read throughput Real-Time / Scaling Journey Lessons
  13. 13. 13 ● Error ○ PerChannelBookieClient - Add for failed on bookie bookkeeper-2:3181 code EIO Bookie EIO Error Root Cause: At peak load Write Cache not big enough to hold accumulated data while waiting on second cache flush
  14. 14. 14 ● Key Prometheus Metrics ○ Bookie ■ bookie_throttled_write_requests ■ bookie_rejected_write_request ○ Broker ■ pulsar_ml_cache_hits_rate ■ pulsar_ml_cache_misses_rate Bookie EIO Error BAD GOOD Key Lesson The more we read from the Broker cache, the less we use the Bookie ledger disk (enabling faster flush of write cache → ledger)
  15. 15. 15 ● EBS drives for Journal & Ledger ○ GP3 with max settings 16000 IOPS, 1000 MB/s ● Broker Cache ○ managedLedgerCacheEvictionTimeThresholdMillis: "5000" (Default: 1000) ○ managedLedgerCacheSizeMB: "512" (Default: 20% of total direct Memory) ● Bookie ○ dbStorage_writeCacheMaxSizeMb: "3072" (Default: 25% of total direct memory) ○ dbStorage_rocksDB_blockCacheSize: "1073741824" (Default: 10% of total direct memory) ○ journalMaxGroupWaitMSec: "10" (Default: 1ms) ● Scaling approach ○ Scale-out Bookies ○ Scale-up and Scale-out Brokers Key Scaling Settings . . .
  16. 16. 16 We’re not at millions yet but we’re seeing a trend . . . . 1) Simulated 300 users for about 18 hours with artificially short 1 minute calls 2) 500k topics created (250k Audio / 250k Signal Analytics) Latest Results
  17. 17. 17 Observations: ZooKeeper ZK JVM Heap demands increasing . . .
  18. 18. 18 Observations: ZooKeeper ZK Disk Usage Increasing . . . Suppressed: java.io.IOException: No space left on device at org.apache.zookeeper.server.SyncRequestProcessor$1.run(SyncRequestProcessor.java :135) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:31 2) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java: 406) ~[org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] . . [Snapshot Thread] ERROR org.apache.zookeeper.server.ZooKeeperServer - Severe unrecoverable error, exiting
  19. 19. 19 Observations: ZooKeeper ZK 99%ile response times increasing. . .
  20. 20. 20 Observations: Broker Broker Heap Increasing . . . Topic Metadata here as well as in ZK
  21. 21. 21 Implications 1) ZooKeeper a) More Heap b) More CPU for GC (and to avoid throttling during GC) c) Watch ZooKeeper disk space /pulsar/data 2) Broker a) More Heap b) Maybe more CPU for GC (and to avoid throttling during GC) c) Watch for Broker → ZK latency issues i) zooKeeperSessionTimeoutMillis: "60000" (default: 30000) ii) zooKeeperOperationTimeoutSeconds: "60" (default: 30)
  22. 22. 22 Recap: Key Metrics for our Streaming Use-Case
  23. 23. 23 Thanks Cogito Bruce, Hamid, Andy, Jimmy, George, Gibby, Kyle, Matt, Amanda, John, Ian, Mihai, Luis, Anthony, Karl and many more Pulsar Community Addison, Sijie, Matteo, Joshua etc.
  24. 24. 24 Thank you!
  25. 25. 25 ● Benchmarking Pulsar and Kafka - A More Accurate Perspective on Pulsar’s Performance ○ https://streamnative.io/en/blog/tech/2020-11-09-benchmark-pulsar-kafka-performance#maximum-t hroughput-test ● Taking a Deep-Dive into Apache Pulsar Architecture for Performance Tuning ○ https://streamnative.io/en/blog/tech/2021-01-14-pulsar-architecture-performance-tuning ● Understanding How Apache Pulsar Works ○ https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works References
  • jitabc

    Jun. 28, 2021

Lessons learned in how to configure Pulsar on Kubernetes to handle millions of topics.

Vues

Nombre de vues

132

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

0

Actions

Téléchargements

6

Partages

0

Commentaires

0

Mentions J'aime

1

×