Deploying Kafka at Dropbox
Alternately: how to handle 10,000,000 QPS in one cluster (but don't)
The Plan
• Welcome
• Use Case
• Initial Design
• Iterations of Woe
• Current Setup
• Future Plans
Your Speakers
• Mark Smith <zorkian@dropbox.com>

formerly of Google, Bump, StumbleUpon, etc

likes small airplanes and no...
The Plan
• Welcome
• Use Case
• Initial Design
• Iterations of Woe
• Current Setup
• Future Plans
Dropbox
• Over 500 million signups
• Exabyte scale storage system
• Multiple hardware locations + AWS
Log Events
• Wide distribution (1,000 categories)
• Several do >1M QPS each + long tail
• About 200TB/day (raw)
• Payloads...
Current System
• Existing system based on Scribe + HDFS
• Aggregate to single destination for analytics
• Powers Hive and ...
The Plan
• Welcome
• Use Case
• Initial Design
• Iterations of Woe
• Current Setup
• Future Plans
Initial Design
• One big cluster
• 20 brokers: 96GB RAM, 16x2TB disk, JBOD config
• ZK ensemble run separately (5 members)...
The Plan
• Welcome
• Use Case
• Initial Design
• Iterations of Woe
• Current Setup
• Future Plans
Unexpected Catastrophes
• Disks failure or reaching 100%
• Repair is manual, won't expire unless caught up
• Crash looping...
System Errors
• Controller issues
• Sometimes goes AWOL with e.g. big rebalances
• Can have multiple controllers (during s...
Lack of Tooling
• Usually left to the reader
• Few best practices
• But we love Kafka Manager
• More to come later!
Newer Clients
• State of Go/Python clients
• Bad behavior at scale
• Laserbeam, retries, backoff
• Too many connections ==...
Bad Configs
• Many, many tunables -- lots of rope
• Unclean leader election
• Preferred leader automation
• Disk threads (...
The Plan
• Welcome
• Use Case
• Initial Design
• Iterations of Woe
• Current Setup
• Future Plans
Hardware
• Hardware RAID 10
• ~25TB usable/box (spinning rust)
• During broker replacement
• 200ms p99 commit latency down...
Monitoring
• MPS vs QPS (metadata reqs!)
• Bad Stuff graph
• Disk utilization/latency
• Heap usage
• Number of controllers
Tooling
• Rolling restarter (health checks!)
• Rate limited partition rebalancer (MPS)
• Config verifier/enforcer
• Coordi...
Customer Culture
• Topics : organization :: partitions : scale
• Do not hash to partitions
• No ordering requirements
• Na...
Success! x
• Kafka goes fast (18M+ MPS on 20 brokers)
• Multiple parallel consumption
• Low latency (at high produce rates...
The Plan
• Welcome
• Use Case
• Initial Design
• Iterations of Woe
• Current Setup
• Future Plans
The Future
• Big is fun but has problems
• Open source our tooling
• Moving towards replication
• Automatic up-partitionin...
Deploying Kafka at Dropbox
• Mark Smith <zorkian@dropbox.com>
• Sean Fellows <fellows@dropbox.com>
We would love to talk w...
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Prochain SlideShare
Chargement dans…5
×

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

2 463 vues

Publié le

At Dropbox we are currently handling approximately 10,000,000 messages per second at peak across our handful of Kafka clusters. The largest of which has hit throughputs of 7,000,000 per second (~30 Gbps) on only 20 nodes. We’ll walk you through the steps we took to get where we are, the design that works for us — and those that didn’t. We’ll talk about the tooling we had to build and what we want to see exist.
We’ll dive deeper into configuration and provide a blueprint you can follow. We’ll talk about the trials and tribulations of using Kafka — including ways we’ve set our clusters on fire, ways we’ve lost data, ways we’ve turned our hairs gray, and ways we’ve heroically saved the day for our users. Finally, we’ll spend time on some of the work we’re doing to handle consumer coordination across our many different systems and to integrate Kafka into a well established corporate infrastructure. (I.e., making Kafka “”play nice”” with everybody.)

Publié dans : Ingénierie
0 commentaire
1 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
2 463
Sur SlideShare
0
Issues des intégrations
0
Intégrations
1 686
Actions
Partages
0
Téléchargements
49
Commentaires
0
J’aime
1
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

  1. 1. Deploying Kafka at Dropbox Alternately: how to handle 10,000,000 QPS in one cluster (but don't)
  2. 2. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  3. 3. Your Speakers • Mark Smith <zorkian@dropbox.com>
 formerly of Google, Bump, StumbleUpon, etc
 likes small airplanes and not getting paged
 • Sean Fellows <fellows@dropbox.com>
 formerly of Google
 likes corgis and distributed systems
  4. 4. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  5. 5. Dropbox • Over 500 million signups • Exabyte scale storage system • Multiple hardware locations + AWS
  6. 6. Log Events • Wide distribution (1,000 categories) • Several do >1M QPS each + long tail • About 200TB/day (raw) • Payloads range from empty to 15MB JSON blobs
  7. 7. Current System • Existing system based on Scribe + HDFS • Aggregate to single destination for analytics • Powers Hive and standard map-reduce type analytics
 Want: real-time stream processing!
  8. 8. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  9. 9. Initial Design • One big cluster • 20 brokers: 96GB RAM, 16x2TB disk, JBOD config • ZK ensemble run separately (5 members) • Kafka 0.8.2 from Github • LinkedIn configuration recommendations
  10. 10. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  11. 11. Unexpected Catastrophes • Disks failure or reaching 100% • Repair is manual, won't expire unless caught up • Crash looping, controller load • Simultaneous restarts • Even graceful, recovery is sometimes very bad (even 0.9!) • Rebalancing is dangerous • Saturates disks, partitions fall out of ISRs, offline, etc
  12. 12. System Errors • Controller issues • Sometimes goes AWOL with e.g. big rebalances • Can have multiple controllers (during serial operations) • Cascading OOMs • Too many connections
  13. 13. Lack of Tooling • Usually left to the reader • Few best practices • But we love Kafka Manager • More to come later!
  14. 14. Newer Clients • State of Go/Python clients • Bad behavior at scale • Laserbeam, retries, backoff • Too many connections == OOM • Good clients take time
  15. 15. Bad Configs • Many, many tunables -- lots of rope • Unclean leader election • Preferred leader automation • Disk threads (thanks Gwen!) • Little modern documentation on running at scale • Todd Palino helped us out early, tho, so thank you!
  16. 16. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  17. 17. Hardware • Hardware RAID 10 • ~25TB usable/box (spinning rust) • During broker replacement • 200ms p99 commit latency down to 10ms! • Failure tolerance, full disk protection • Canary cluster
  18. 18. Monitoring • MPS vs QPS (metadata reqs!) • Bad Stuff graph • Disk utilization/latency • Heap usage • Number of controllers
  19. 19. Tooling • Rolling restarter (health checks!) • Rate limited partition rebalancer (MPS) • Config verifier/enforcer • Coordinated consumption (pre-0.9) • Auditing framework
  20. 20. Customer Culture • Topics : organization :: partitions : scale • Do not hash to partitions • No ordering requirements • Namespaces and ownership are required
  21. 21. Success! x • Kafka goes fast (18M+ MPS on 20 brokers) • Multiple parallel consumption • Low latency (at high produce rates) • 0.9 is leaps ahead of 0.8.2 (upgrade!) • Supportable by a small team (at our scale)
  22. 22. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  23. 23. The Future • Big is fun but has problems • Open source our tooling • Moving towards replication • Automatic up-partitioning and rebalancing • Expanding auditing to clients • Low volume latencies
  24. 24. Deploying Kafka at Dropbox • Mark Smith <zorkian@dropbox.com> • Sean Fellows <fellows@dropbox.com> We would love to talk with other people who are running Kafka at similar scales. Email us! And... questions! (If we have time.)

×