Publicité
Publicité

Contenu connexe

Présentations pour vous(20)

En vedette(20)

Publicité
Publicité

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

  1. Deploying Kafka at Dropbox Alternately: how to handle 10,000,000 QPS in one cluster (but don't)
  2. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  3. Your Speakers • Mark Smith <zorkian@dropbox.com>
 formerly of Google, Bump, StumbleUpon, etc
 likes small airplanes and not getting paged
 • Sean Fellows <fellows@dropbox.com>
 formerly of Google
 likes corgis and distributed systems
  4. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  5. Dropbox • Over 500 million signups • Exabyte scale storage system • Multiple hardware locations + AWS
  6. Log Events • Wide distribution (1,000 categories) • Several do >1M QPS each + long tail • About 200TB/day (raw) • Payloads range from empty to 15MB JSON blobs
  7. Current System • Existing system based on Scribe + HDFS • Aggregate to single destination for analytics • Powers Hive and standard map-reduce type analytics
 Want: real-time stream processing!
  8. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  9. Initial Design • One big cluster • 20 brokers: 96GB RAM, 16x2TB disk, JBOD config • ZK ensemble run separately (5 members) • Kafka 0.8.2 from Github • LinkedIn configuration recommendations
  10. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  11. Unexpected Catastrophes • Disks failure or reaching 100% • Repair is manual, won't expire unless caught up • Crash looping, controller load • Simultaneous restarts • Even graceful, recovery is sometimes very bad (even 0.9!) • Rebalancing is dangerous • Saturates disks, partitions fall out of ISRs, offline, etc
  12. System Errors • Controller issues • Sometimes goes AWOL with e.g. big rebalances • Can have multiple controllers (during serial operations) • Cascading OOMs • Too many connections
  13. Lack of Tooling • Usually left to the reader • Few best practices • But we love Kafka Manager • More to come later!
  14. Newer Clients • State of Go/Python clients • Bad behavior at scale • Laserbeam, retries, backoff • Too many connections == OOM • Good clients take time
  15. Bad Configs • Many, many tunables -- lots of rope • Unclean leader election • Preferred leader automation • Disk threads (thanks Gwen!) • Little modern documentation on running at scale • Todd Palino helped us out early, tho, so thank you!
  16. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  17. Hardware • Hardware RAID 10 • ~25TB usable/box (spinning rust) • During broker replacement • 200ms p99 commit latency down to 10ms! • Failure tolerance, full disk protection • Canary cluster
  18. Monitoring • MPS vs QPS (metadata reqs!) • Bad Stuff graph • Disk utilization/latency • Heap usage • Number of controllers
  19. Tooling • Rolling restarter (health checks!) • Rate limited partition rebalancer (MPS) • Config verifier/enforcer • Coordinated consumption (pre-0.9) • Auditing framework
  20. Customer Culture • Topics : organization :: partitions : scale • Do not hash to partitions • No ordering requirements • Namespaces and ownership are required
  21. Success! x • Kafka goes fast (18M+ MPS on 20 brokers) • Multiple parallel consumption • Low latency (at high produce rates) • 0.9 is leaps ahead of 0.8.2 (upgrade!) • Supportable by a small team (at our scale)
  22. The Plan • Welcome • Use Case • Initial Design • Iterations of Woe • Current Setup • Future Plans
  23. The Future • Big is fun but has problems • Open source our tooling • Moving towards replication • Automatic up-partitioning and rebalancing • Expanding auditing to clients • Low volume latencies
  24. Deploying Kafka at Dropbox • Mark Smith <zorkian@dropbox.com> • Sean Fellows <fellows@dropbox.com> We would love to talk with other people who are running Kafka at similar scales. Email us! And... questions! (If we have time.)
Publicité