SlideShare a Scribd company logo
1 of 49
Download to read offline
Cruise Control: Effortless Management of Kafka Clusters
Adem Efe Gencer
Senior Software Engineer
LinkedIn
Kafka: A Distributed Stream Processing Platform
: High throughput & low latency
: Message persistence on partitioned data
: Total ordering within each partition
2
Key Concepts: Brokers, Topics, Partitions, and Replicas
: Broker-0
1 2 11 21
: Broker-1 : Broker-2
: A Replica of Partition-1 of Blue Topic1
Kafka Cluster
3
Key Concepts: Leaders and Followers
: Broker-0
: The Leader Replica
1 2 11 21
: Broker-1 : Broker-2
1
: A Follower Replica1
4
Key Concepts: Producers
Producer-1 Producer-2
: Broker-0
1 2 11 21
: Broker-1 : Broker-2
5
Key Concepts: Consumers
Consumer-1 Consumer-2
: Broker-0
1 2 11 21
: Broker-1 : Broker-2
6
Key Concepts: Failover via Leadership Transfer
: Broker-0
2
: Broker-2
✗
1 211 1
7
Key Concepts: Failover via Leadership Transfer
: Broker-0
1 2 11 21
: Broker-2
✗
8
Kafka Incurs Management Overhead
: Large deployments – e.g. @ : 3K+ Brokers,
100K+ Topics, 5M Partitions, 5T Messages / day
: Frequent hardware failures
: Load skew among brokers
: Kafka cluster expansion and reduction
✗
“Elephant” (CC0): https://pixabay.com/en/elephant-safari-animal-defence-1421167, "Seesaw at Evelle" by Rachel Coleman (CC BY-SA 2.0): https://www.flickr.com/photos/rmc28/4862153119, “Inflatable Balloons” (Public Domain): https://commons.wikimedia.org/wiki/File:InflatableBalloons.jpg
9
Alleviating the Management Overhead
1
2
3
Admin Operations for Cluster Maintenance
Real-Time Monitoring of Kafka Clusters
Anomaly Detection with Self-Healing
10
Admin Operations for Cluster Maintenance1
: Dynamically balance the cluster load
: Add / remove brokers
: Demote brokers – i.e. remove leadership of all replicas
+-
: Trigger preferred leader election
: Fix offline replicas
11
Admin Operations for Cluster Maintenance1
: Dynamically balance the cluster load
: Add / remove brokers
: Demote brokers – i.e. remove leadership of all replicas
+-
: Trigger preferred leader election
: Fix offline replicas
12
Dynamically Balance the Cluster Load
: Never exceed the capacity of broker resources
– e.g. disk, CPU, network bandwidth
: Guarantee rack-aware distribution of replicas
Must satisfy hard goals, including:
: Enforce operational requirements – e.g. maximum
replica count per broker
13
Satisfy soft goals as much as possible – i.e. best effort
: Balance replica and leader distribution
: Balance potential outbound network load
: Balance distribution of partitions from the same topic
: Balance disk, CPU, inbound/outbound network
traffic utilization of brokers
✗
✗
Dynamically Balance the Cluster Load
14
Anomaly Detection with Self-Healing2
: Goal violation – rebalance cluster
: Broker failure – decommission broker(s)
: Metric anomaly – demote broker(s)
✗
15
: Disk failure – fix offline replica(s)
✗
Real-Time Monitoring of Kafka Clusters3
: Check the health of brokers, disks, and user tasks
: Examine the replica, leader, and load distribution
: Identify under-replicated, under-min-ISR, and offline
partitions
16
Building Blocks of Management: Moving Replicas
: Broker-0
1 2
1 2 1
: Broker-1
Replica Move
17
Building Blocks of Management: Moving Replicas
: Broker-0
1 2
1 2
: Broker-1
Replica Move
1
* Replica swap: Bidirectional reassignments of distinct partition replicas among brokers
Broader impact, but expensive
• Requires data transfer*
18
Building Blocks of Management: Moving Leadership
: Broker-0
12
1 2 1
: Broker-1
Leadership Move
19
Building Blocks of Management: Moving Leadership
: Broker-0
12
1 2 1
: Broker-1
Leadership Move
Cheap, but has limited impact
• Affects network bytes out and CPU
20
A Multi-Objective Optimization Problem
Achieve conflicting cluster management goals
while minimizing the impact of required
operations on user traffic
21
ARCHITECTURE
“Joy Oil Gas Station Blueprints” (Public Domain): https://commons.wikimedia.org/wiki/File:Joy_Oil_gas_station_blueprints.jpg
Cruise Control Architecture
REST API
Monitor
Executor
Kafka
Cluster
Metrics
Reporter
Sample
Store
Metric
Sampler
Goal(s)
Analyzer
Throttled
Proposal
Execution
Reported
Metrics
Backup and Recovery
Metrics
Reporter T.
Load
History T.
Broker
Failures
Pluggable
Component
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Internal Topic
Capacity Resolver
• Implements a public interface
• Accepts custom user code
• Created and used by Cruise
Control and its metrics reporter
23
Metrics Reporter
Kafka
Cluster
Metrics
Reporter
Metrics
Reporter T.
Load
History T.Produces selected Kafka cluster metrics to the
configured metrics reporter topic with the
configured frequency
24
Monitor
Monitor
Sample
Store
Metric
Sampler
Capacity Resolver
Generates a model ( ) to describe the cluster
25
Monitor: Cluster Model ( )
…
Monitoring windows
disk
cpu
nw-in
nw-out
time latest
utilization
: Load – current and historical utilization of brokers and replicas
: Topology – rack, host, and broker distribution
: Placement – replica, leadership, and partition distribution
26
Monitor: Metric Sampler
Monitor
Kafka
Cluster
Sample
Store
Metric
Sampler
Reported
Metrics
Metrics
Reporter T.
Capacity Resolver
• Periodically (e.g. every 5 min) consumes the reported
metrics to model the load on brokers and partitions
27
Monitor: Sample Store
Monitor
Kafka
Cluster
Sample
Store
Metric
Sampler
Capacity Resolver
•
• Produces broker and partition models to load history
topic, and uses the stored data to recover upon failure
Backup and Recovery Load
History T.
28
Monitor: Capacity Resolver
Monitor
Sample
Store
Metric
Sampler
Capacity Resolver
•
•
• Gathers the broker capacities from a pluggable resolver
29
Analyzer
Goal(s)
Analyzer
Generates proposals to achieve goals via a fast and near-optimal
heuristic solution
30
Analyzer: Goals
Goal(s)
Analyzer
Generates proposals to achieve goals via a fast and near-optimal
heuristic solution
: Priorities – custom order of optimization
: Strictness – hard (e.g. rack awareness) or soft (e.g. resource
utilization balance) optimization demands
: Modes – e.g. kafka-assigner (https://github.com/linkedin/kafka-tools)
31
Analyzer: Proposals
Goal(s)
Analyzer
Generates proposals to achieve goals via a fast and near-optimal
heuristic solution
+ =Goal(s)
Proposals – in order of priority:
• Leadership move > Replica move > Replica swap
32
Executor
Executor
Kafka
Cluster
Throttled
Proposal
Execution
Proposal execution:
• Dynamically controls the maximum number of
concurrent leadership / replica reassignments
• Ensures only one execution at a time
• Enables graceful cancellation of ongoing executions
• Integration with replication quotas (KIP-73)
33
Anomaly Detector
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Identifies, notifies, and fixes (self-healing):
• Violation of anomaly detection goals
• Broker failures
• Metric anomalies
• Disk failures (JBOD)
Violation of desired replication factor
34
: Faulty vs. Healthy Cluster
: Reactive vs. Proactive Mitigation
Anomaly Detector: Goal Violations and Self-Healing
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Checks for the violation of the anomaly
detection goals
• Identifies fixable and unfixable goal violations
• Self-healing triggers a cluster rebalance operation
• Avoids false positives due to broker failure, upgrade,
restart, or release certification
35
HealthyFaulty ProactiveReactive
Anomaly Detector: Broker Failures
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Kafka
Cluster
Broker
Failures
Concerned with whether brokers are responsive:
• Ignores the internal state deterioration of brokers
• Identifies fail-stop failures
36
HealthyFaulty ProactiveReactive
Anomaly Detector: Broker Failures and Self-Healing
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Kafka
Cluster
Broker
Failures
Checks for broker failures:
• Enables a grace period to lower false positives – e.g.
due to upgrade, restart, or release certification
• Self-healing triggers a remove operation for failed
brokers
37
HealthyFaulty ProactiveReactive
Anomaly Detector: Reactive Mitigation
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Requires immediate attention of affected services
Poor user experience due to frequent service
interruptions
Cluster maintenance becomes costly
~
Server & network failures
Size of clusters
Volume of user traffic
Hardware degradation
38
Anomaly Detector: Metric Anomaly
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Checks for abnormal changes in broker
metrics – e.g. a recent spike in log flush time:
• Self-healing triggers a demote operation for slow
brokers
39
HealthyFaulty ProactiveReactive
Anomaly Detector: Metric Anomaly
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
Compares current and historical metrics to
detect slow brokers:
• The comparison in the default finder is based on the
percentile rank of the latest metric value
• Metrics of interest are configurable – e.g. local time of
produce / consume / follower fetch, log flush time
• Supports multiple active finders
40
HealthyFaulty ProactiveReactive
Anomaly Detector: Proactive Mitigation
Anomaly Detector
Anomaly Notifier
Goal
Violation
Metric
Anomaly
Broker
Failure
Finder(s)
In-place fix of slow / faulty brokers is non-trivial
• The root cause could be a hardware issue (e.g. a
misbehaving disk), a software glitch, or a traffic shift
• Hence, the mitigation strategies are agnostic of the
particular issue with the broker
41
“Three-toed-sloth” (CC BY 2.5): https://en.wikipedia.org/wiki/File:MC_Drei-Finger-Faultier.jpg
REST API
REST API
Supports sync and async endpoints including:
GUI & multi-cluster management
• Cluster / Partition Load
• Proposals
• Kafka Cluster State
• Cruise Control State
• User Tasks
• Review Board
GET
POST
• Add / Remove / Demote Broker
• Rebalance Cluster
• Fix Offline Replicas (JBOD)
• Stop Ongoing Execution
• Pause / Resume Sampling
• Admin – ongoing behavior changes
42
Managing the Manager – Monitoring Cruise Control
Reported JMX metrics include:
: Broker failure, goal violation, and metric anomaly rate
: Cluster model and sampling performance
: Stats on proposal generation
: Started, stopped, and ongoing executions in different
modes, and the status of balancing tasks
Executor
Anomaly
Detector
Monitor
Analyzer
43
Evaluation: Remove Brokers and Rebalance
21 All Topics Bytes-In Rate [per broker]Incomingdata(bytes/s)
Time (hours)
44
Evaluation: Remove Brokers and Rebalance
21 All Topics Bytes-Out Rate [per broker]Outgoingdata(bytes/s)
Time (hours)
45
Evaluation: Remove Brokers and Rebalance
21 Number of Partitions [per broker]Partitioncount
Time (hours)
46
Summary
A system that provides effortless
management of Kafka clusters
Admin Operations for Cluster Maintenance
Anomaly Detection with Self-healing
Integration with Other Systems – e.g. Apache Helix
Real-Time Monitoring of Kafka Clusters
47
More…
: Open source repository
(https://github.com/linkedin/cruise-control)
: Gitter room (https://gitter.im/kafka-cruise-control)
48
: UI (https://github.com/linkedin/cruise-control-ui)
Cruise Control: Effortless management of Kafka clusters

More Related Content

What's hot

Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...HostedbyConfluent
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsBrendan Gregg
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote
 
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentKafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentHostedbyConfluent
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...HostedbyConfluent
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 

What's hot (20)

Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
 
kafka
kafkakafka
kafka
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentKafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 

Similar to Cruise Control: Effortless management of Kafka clusters

Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performanceconfluent
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdfTarekHamdi8
 
Tokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdfTokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdfssuser2ae721
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configsconfluent
 
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support PerspectiveApache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support PerspectiveHostedbyConfluent
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking VN
 
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022HostedbyConfluent
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
 
Cloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewCloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewMessaging Meetup
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectKaufman Ng
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorStéphane Maldini
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
 

Similar to Cruise Control: Effortless management of Kafka clusters (20)

Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
Large scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear passLarge scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear pass
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdf
 
Tokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdfTokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdf
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
 
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support PerspectiveApache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object Model
 
Cloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewCloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical Overview
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
 

Recently uploaded

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Recently uploaded (20)

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

Cruise Control: Effortless management of Kafka clusters

  • 1. Cruise Control: Effortless Management of Kafka Clusters Adem Efe Gencer Senior Software Engineer LinkedIn
  • 2. Kafka: A Distributed Stream Processing Platform : High throughput & low latency : Message persistence on partitioned data : Total ordering within each partition 2
  • 3. Key Concepts: Brokers, Topics, Partitions, and Replicas : Broker-0 1 2 11 21 : Broker-1 : Broker-2 : A Replica of Partition-1 of Blue Topic1 Kafka Cluster 3
  • 4. Key Concepts: Leaders and Followers : Broker-0 : The Leader Replica 1 2 11 21 : Broker-1 : Broker-2 1 : A Follower Replica1 4
  • 5. Key Concepts: Producers Producer-1 Producer-2 : Broker-0 1 2 11 21 : Broker-1 : Broker-2 5
  • 6. Key Concepts: Consumers Consumer-1 Consumer-2 : Broker-0 1 2 11 21 : Broker-1 : Broker-2 6
  • 7. Key Concepts: Failover via Leadership Transfer : Broker-0 2 : Broker-2 ✗ 1 211 1 7
  • 8. Key Concepts: Failover via Leadership Transfer : Broker-0 1 2 11 21 : Broker-2 ✗ 8
  • 9. Kafka Incurs Management Overhead : Large deployments – e.g. @ : 3K+ Brokers, 100K+ Topics, 5M Partitions, 5T Messages / day : Frequent hardware failures : Load skew among brokers : Kafka cluster expansion and reduction ✗ “Elephant” (CC0): https://pixabay.com/en/elephant-safari-animal-defence-1421167, "Seesaw at Evelle" by Rachel Coleman (CC BY-SA 2.0): https://www.flickr.com/photos/rmc28/4862153119, “Inflatable Balloons” (Public Domain): https://commons.wikimedia.org/wiki/File:InflatableBalloons.jpg 9
  • 10. Alleviating the Management Overhead 1 2 3 Admin Operations for Cluster Maintenance Real-Time Monitoring of Kafka Clusters Anomaly Detection with Self-Healing 10
  • 11. Admin Operations for Cluster Maintenance1 : Dynamically balance the cluster load : Add / remove brokers : Demote brokers – i.e. remove leadership of all replicas +- : Trigger preferred leader election : Fix offline replicas 11
  • 12. Admin Operations for Cluster Maintenance1 : Dynamically balance the cluster load : Add / remove brokers : Demote brokers – i.e. remove leadership of all replicas +- : Trigger preferred leader election : Fix offline replicas 12
  • 13. Dynamically Balance the Cluster Load : Never exceed the capacity of broker resources – e.g. disk, CPU, network bandwidth : Guarantee rack-aware distribution of replicas Must satisfy hard goals, including: : Enforce operational requirements – e.g. maximum replica count per broker 13
  • 14. Satisfy soft goals as much as possible – i.e. best effort : Balance replica and leader distribution : Balance potential outbound network load : Balance distribution of partitions from the same topic : Balance disk, CPU, inbound/outbound network traffic utilization of brokers ✗ ✗ Dynamically Balance the Cluster Load 14
  • 15. Anomaly Detection with Self-Healing2 : Goal violation – rebalance cluster : Broker failure – decommission broker(s) : Metric anomaly – demote broker(s) ✗ 15 : Disk failure – fix offline replica(s) ✗
  • 16. Real-Time Monitoring of Kafka Clusters3 : Check the health of brokers, disks, and user tasks : Examine the replica, leader, and load distribution : Identify under-replicated, under-min-ISR, and offline partitions 16
  • 17. Building Blocks of Management: Moving Replicas : Broker-0 1 2 1 2 1 : Broker-1 Replica Move 17
  • 18. Building Blocks of Management: Moving Replicas : Broker-0 1 2 1 2 : Broker-1 Replica Move 1 * Replica swap: Bidirectional reassignments of distinct partition replicas among brokers Broader impact, but expensive • Requires data transfer* 18
  • 19. Building Blocks of Management: Moving Leadership : Broker-0 12 1 2 1 : Broker-1 Leadership Move 19
  • 20. Building Blocks of Management: Moving Leadership : Broker-0 12 1 2 1 : Broker-1 Leadership Move Cheap, but has limited impact • Affects network bytes out and CPU 20
  • 21. A Multi-Objective Optimization Problem Achieve conflicting cluster management goals while minimizing the impact of required operations on user traffic 21
  • 22. ARCHITECTURE “Joy Oil Gas Station Blueprints” (Public Domain): https://commons.wikimedia.org/wiki/File:Joy_Oil_gas_station_blueprints.jpg
  • 23. Cruise Control Architecture REST API Monitor Executor Kafka Cluster Metrics Reporter Sample Store Metric Sampler Goal(s) Analyzer Throttled Proposal Execution Reported Metrics Backup and Recovery Metrics Reporter T. Load History T. Broker Failures Pluggable Component Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Internal Topic Capacity Resolver • Implements a public interface • Accepts custom user code • Created and used by Cruise Control and its metrics reporter 23
  • 24. Metrics Reporter Kafka Cluster Metrics Reporter Metrics Reporter T. Load History T.Produces selected Kafka cluster metrics to the configured metrics reporter topic with the configured frequency 24
  • 26. Monitor: Cluster Model ( ) … Monitoring windows disk cpu nw-in nw-out time latest utilization : Load – current and historical utilization of brokers and replicas : Topology – rack, host, and broker distribution : Placement – replica, leadership, and partition distribution 26
  • 27. Monitor: Metric Sampler Monitor Kafka Cluster Sample Store Metric Sampler Reported Metrics Metrics Reporter T. Capacity Resolver • Periodically (e.g. every 5 min) consumes the reported metrics to model the load on brokers and partitions 27
  • 28. Monitor: Sample Store Monitor Kafka Cluster Sample Store Metric Sampler Capacity Resolver • • Produces broker and partition models to load history topic, and uses the stored data to recover upon failure Backup and Recovery Load History T. 28
  • 29. Monitor: Capacity Resolver Monitor Sample Store Metric Sampler Capacity Resolver • • • Gathers the broker capacities from a pluggable resolver 29
  • 30. Analyzer Goal(s) Analyzer Generates proposals to achieve goals via a fast and near-optimal heuristic solution 30
  • 31. Analyzer: Goals Goal(s) Analyzer Generates proposals to achieve goals via a fast and near-optimal heuristic solution : Priorities – custom order of optimization : Strictness – hard (e.g. rack awareness) or soft (e.g. resource utilization balance) optimization demands : Modes – e.g. kafka-assigner (https://github.com/linkedin/kafka-tools) 31
  • 32. Analyzer: Proposals Goal(s) Analyzer Generates proposals to achieve goals via a fast and near-optimal heuristic solution + =Goal(s) Proposals – in order of priority: • Leadership move > Replica move > Replica swap 32
  • 33. Executor Executor Kafka Cluster Throttled Proposal Execution Proposal execution: • Dynamically controls the maximum number of concurrent leadership / replica reassignments • Ensures only one execution at a time • Enables graceful cancellation of ongoing executions • Integration with replication quotas (KIP-73) 33
  • 34. Anomaly Detector Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Identifies, notifies, and fixes (self-healing): • Violation of anomaly detection goals • Broker failures • Metric anomalies • Disk failures (JBOD) Violation of desired replication factor 34 : Faulty vs. Healthy Cluster : Reactive vs. Proactive Mitigation
  • 35. Anomaly Detector: Goal Violations and Self-Healing Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Checks for the violation of the anomaly detection goals • Identifies fixable and unfixable goal violations • Self-healing triggers a cluster rebalance operation • Avoids false positives due to broker failure, upgrade, restart, or release certification 35 HealthyFaulty ProactiveReactive
  • 36. Anomaly Detector: Broker Failures Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Kafka Cluster Broker Failures Concerned with whether brokers are responsive: • Ignores the internal state deterioration of brokers • Identifies fail-stop failures 36 HealthyFaulty ProactiveReactive
  • 37. Anomaly Detector: Broker Failures and Self-Healing Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Kafka Cluster Broker Failures Checks for broker failures: • Enables a grace period to lower false positives – e.g. due to upgrade, restart, or release certification • Self-healing triggers a remove operation for failed brokers 37 HealthyFaulty ProactiveReactive
  • 38. Anomaly Detector: Reactive Mitigation Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Requires immediate attention of affected services Poor user experience due to frequent service interruptions Cluster maintenance becomes costly ~ Server & network failures Size of clusters Volume of user traffic Hardware degradation 38
  • 39. Anomaly Detector: Metric Anomaly Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Checks for abnormal changes in broker metrics – e.g. a recent spike in log flush time: • Self-healing triggers a demote operation for slow brokers 39 HealthyFaulty ProactiveReactive
  • 40. Anomaly Detector: Metric Anomaly Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) Compares current and historical metrics to detect slow brokers: • The comparison in the default finder is based on the percentile rank of the latest metric value • Metrics of interest are configurable – e.g. local time of produce / consume / follower fetch, log flush time • Supports multiple active finders 40 HealthyFaulty ProactiveReactive
  • 41. Anomaly Detector: Proactive Mitigation Anomaly Detector Anomaly Notifier Goal Violation Metric Anomaly Broker Failure Finder(s) In-place fix of slow / faulty brokers is non-trivial • The root cause could be a hardware issue (e.g. a misbehaving disk), a software glitch, or a traffic shift • Hence, the mitigation strategies are agnostic of the particular issue with the broker 41 “Three-toed-sloth” (CC BY 2.5): https://en.wikipedia.org/wiki/File:MC_Drei-Finger-Faultier.jpg
  • 42. REST API REST API Supports sync and async endpoints including: GUI & multi-cluster management • Cluster / Partition Load • Proposals • Kafka Cluster State • Cruise Control State • User Tasks • Review Board GET POST • Add / Remove / Demote Broker • Rebalance Cluster • Fix Offline Replicas (JBOD) • Stop Ongoing Execution • Pause / Resume Sampling • Admin – ongoing behavior changes 42
  • 43. Managing the Manager – Monitoring Cruise Control Reported JMX metrics include: : Broker failure, goal violation, and metric anomaly rate : Cluster model and sampling performance : Stats on proposal generation : Started, stopped, and ongoing executions in different modes, and the status of balancing tasks Executor Anomaly Detector Monitor Analyzer 43
  • 44. Evaluation: Remove Brokers and Rebalance 21 All Topics Bytes-In Rate [per broker]Incomingdata(bytes/s) Time (hours) 44
  • 45. Evaluation: Remove Brokers and Rebalance 21 All Topics Bytes-Out Rate [per broker]Outgoingdata(bytes/s) Time (hours) 45
  • 46. Evaluation: Remove Brokers and Rebalance 21 Number of Partitions [per broker]Partitioncount Time (hours) 46
  • 47. Summary A system that provides effortless management of Kafka clusters Admin Operations for Cluster Maintenance Anomaly Detection with Self-healing Integration with Other Systems – e.g. Apache Helix Real-Time Monitoring of Kafka Clusters 47
  • 48. More… : Open source repository (https://github.com/linkedin/cruise-control) : Gitter room (https://gitter.im/kafka-cruise-control) 48 : UI (https://github.com/linkedin/cruise-control-ui)