SlideShare a Scribd company logo
1 of 30
Tips and Tricks for Apache
Kafka™ Ops
June 2020
Jen Snipes, Senior Customer Success Architect
https://www.linkedin.com/in/jennifersnipes/
Kat Grigg, Senior Customer Success Architect @katgrigg
● What is Apache Kafka
● Kafka Operations
○ Configuring
○ Deploying
○ Maintaining
○ Monitoring
○ Debugging
Agenda
What is Apache Kafka?
Open Source
Real time Event Streaming Platform
Scalable, Distributed Log
Fault tolerant storage
Moving Data
Producers
Consumers
Connect
Streams
Log Structured Data Flow
Configuring
Deploying
Maintaining
Monitoring
Debugging
Kafka Operations
6
5 Categories of Ops
Configuring Kafka
7
Data Retention
● Disk space available?
● SLA on consuming?
● Keyed data? Cardinality?
Security
● Start with plaintext
● SSL and SASL? Choose one first
● Console clients are great for testing
Optimization
● Throughput
○ increase partitions, batch size, linger.ms
○ use acks 1
○ compression
○ parallelize consumers
● Latency
○ no compression
○ balance number of partitions
○ increase fetcher threads
○ producer linger.ms=0, consumer
fetch.min.bytes=1
● Durability
○ acks = all, RF = 3, minISR = 2
○ leverage retries
○ unclean leader election set to false
○ disable auto commit
● Availability
○ minISR 1
○ unclean leader election true
○ increase recovery threads
○ low consumer session.timeout
○ optimize number of partitions
Deploying
Kafka
11
Start Easy
● How do you run your database(s)?
● Do you know Docker well?
● Do you know Kubernetes well?
● If no, then don’t use them
● Cloud an option?
● JMX access (metrics reporter)
● Environment Variables
● Logging capacity
● Don’t forget ZooKeeper
● File descriptors (ulimit)
● Disk deployment style
One does not simply start Brokers
Maintaining
Kafka
14
Updates and Upgrades
Rolling upgrade Tips
● Gwen Shapira’s Kafka Summit Talk 2019
● Set protocols and message formats
● Be explicit
● Go slow, too fast and ISR sadness
● Upgrade often, point releases in prod
Rebalancing
● Lose/Add Broker
● Throttle early, adjust later
● Throttles don’t affect in sync
replicas
● Not sure? Batch the reassign
● See Monitoring
Updating Configs
Brokers
● Lots of dynamic config options!
● Sync config across the cluster
Topics
● Overrides are your friend
● Topic level unclean leader election
Monitoring
Kafka
What do you watch?
18
Kafka Thread Model
● Monitor all of these hops with JMX
● Transparent way to see activity
● Especially watch request queue size
● Don’t blindly bump thread count
Log Cleaner / Others
● Watch your log cleaner with JMX
● Bytes In/Out just makes sense
● Don’t need to dashboard everything
● Alert don’t automate (at least at first)
Broker Metrics
Top 5 Broker JMX metrics you should
be watching
● UnderReplicatedPartitions
● RequestHandlerAvgIdlePercent
● NetworkProcessorAvgIdlePercent
● RequestQueueSize
● TotalTimeMs
Clients / Consumers
● batch sizes
● failed-authentication-rate
● io-ratio
● io-wait-ratio
● io-wait-time-ns-avg
Debugging
Kafka
23
When there’s trouble...
Logs about the Log
● Debugging? Head to the logs!
● Defaults separates files
reasonably
● Splunk, ELK, etc.? filter by
appender
● Metadata files are important too
● controller.log -- written to by active controller
● state-change.log -- updates received from and sent to
controller
● server.log -- basic broker operations (segment rolls,
replica fetching, etc.)
● log-cleaner.log -- cleaner thread (related to compaction)
● kafka-authorizer.log -- secure connections only (set to
DEBUG if you want successful connections)
● kafka-request.log -- super verbose, will talk about more
● /var/lib/recovery-point-offset-checkpoint – last offset
flushed to disk
● /var/lib/cleaner-offset-checkpoint – offset up to which
the cleaner has cleaned (compacted topics only)
● /var/lib/replication-offset-checkpoint – last committed
offset
Request logging
● kafka-request.log
● All information about all requests
● Verbose, not great for live debugging
● Use when no other options for root cause
● Rogue clients, single request slowdowns,
etc.
● Kafka Controller Deep Dive
● Kafka Needs No Keeper
● ISR updates aren’t happening
● Replication stopped for no reason
● Broker refusing to become leader for extended
period
● rmr /controller
Controller Issues
The Offsets Topic
● __consumer_offsets → Stores offsets and group metadata
● bin/kafka-run-class kafka.tools.DumpLogSegments --offsets-decoder
● offsets.retention.minutes default 1 day (7 days on 2.0)
● Growing huge? Check the log cleaner health
● High Memory Usage on Brokers? Offsets are cached
● High CPU Usage on Brokers? Make sure this topic isn’t huge
● Please replication factor 3
Restarting? Think about it...
● Restarts in distributed systems are tricky
● Have a reason
● Understand startup time
● Clean shutdowns, limited scope
29
References
• https://www.confluent.io/resources/kafka-the-definitive-guide/
• https://www.confluent.io/kafka-summit-san-francisco-2019/please-upgrade-
apache-kafka-now
• https://conferences.oreilly.com/strata/strata-ny-
2018/public/schedule/detail/68957
• https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-needs-no-
keeper
• https://events.confluent.io/meetups
30
Thank You!

More Related Content

What's hot

What's hot (20)

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafka
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 

Similar to Tips & Tricks for Apache Kafka®

Similar to Tips & Tricks for Apache Kafka® (20)

Tips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaTips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache Kafka
 
kafka
kafkakafka
kafka
 
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
 
High-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deploymentHigh-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deployment
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
LCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at LinaroLCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at Linaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays Singapore
 
MySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the WireMySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the Wire
 
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
 

More from confluent

More from confluent (20)

Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Recently uploaded (20)

Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 

Tips & Tricks for Apache Kafka®

  • 1. Tips and Tricks for Apache Kafka™ Ops June 2020 Jen Snipes, Senior Customer Success Architect https://www.linkedin.com/in/jennifersnipes/ Kat Grigg, Senior Customer Success Architect @katgrigg
  • 2. ● What is Apache Kafka ● Kafka Operations ○ Configuring ○ Deploying ○ Maintaining ○ Monitoring ○ Debugging Agenda
  • 3. What is Apache Kafka? Open Source Real time Event Streaming Platform Scalable, Distributed Log Fault tolerant storage
  • 8. Data Retention ● Disk space available? ● SLA on consuming? ● Keyed data? Cardinality?
  • 9. Security ● Start with plaintext ● SSL and SASL? Choose one first ● Console clients are great for testing
  • 10. Optimization ● Throughput ○ increase partitions, batch size, linger.ms ○ use acks 1 ○ compression ○ parallelize consumers ● Latency ○ no compression ○ balance number of partitions ○ increase fetcher threads ○ producer linger.ms=0, consumer fetch.min.bytes=1 ● Durability ○ acks = all, RF = 3, minISR = 2 ○ leverage retries ○ unclean leader election set to false ○ disable auto commit ● Availability ○ minISR 1 ○ unclean leader election true ○ increase recovery threads ○ low consumer session.timeout ○ optimize number of partitions
  • 12. Start Easy ● How do you run your database(s)? ● Do you know Docker well? ● Do you know Kubernetes well? ● If no, then don’t use them ● Cloud an option?
  • 13. ● JMX access (metrics reporter) ● Environment Variables ● Logging capacity ● Don’t forget ZooKeeper ● File descriptors (ulimit) ● Disk deployment style One does not simply start Brokers
  • 15. Rolling upgrade Tips ● Gwen Shapira’s Kafka Summit Talk 2019 ● Set protocols and message formats ● Be explicit ● Go slow, too fast and ISR sadness ● Upgrade often, point releases in prod
  • 16. Rebalancing ● Lose/Add Broker ● Throttle early, adjust later ● Throttles don’t affect in sync replicas ● Not sure? Batch the reassign ● See Monitoring
  • 17. Updating Configs Brokers ● Lots of dynamic config options! ● Sync config across the cluster Topics ● Overrides are your friend ● Topic level unclean leader election
  • 19. Kafka Thread Model ● Monitor all of these hops with JMX ● Transparent way to see activity ● Especially watch request queue size ● Don’t blindly bump thread count
  • 20. Log Cleaner / Others ● Watch your log cleaner with JMX ● Bytes In/Out just makes sense ● Don’t need to dashboard everything ● Alert don’t automate (at least at first)
  • 21. Broker Metrics Top 5 Broker JMX metrics you should be watching ● UnderReplicatedPartitions ● RequestHandlerAvgIdlePercent ● NetworkProcessorAvgIdlePercent ● RequestQueueSize ● TotalTimeMs
  • 22. Clients / Consumers ● batch sizes ● failed-authentication-rate ● io-ratio ● io-wait-ratio ● io-wait-time-ns-avg
  • 24. Logs about the Log ● Debugging? Head to the logs! ● Defaults separates files reasonably ● Splunk, ELK, etc.? filter by appender ● Metadata files are important too ● controller.log -- written to by active controller ● state-change.log -- updates received from and sent to controller ● server.log -- basic broker operations (segment rolls, replica fetching, etc.) ● log-cleaner.log -- cleaner thread (related to compaction) ● kafka-authorizer.log -- secure connections only (set to DEBUG if you want successful connections) ● kafka-request.log -- super verbose, will talk about more ● /var/lib/recovery-point-offset-checkpoint – last offset flushed to disk ● /var/lib/cleaner-offset-checkpoint – offset up to which the cleaner has cleaned (compacted topics only) ● /var/lib/replication-offset-checkpoint – last committed offset
  • 25. Request logging ● kafka-request.log ● All information about all requests ● Verbose, not great for live debugging ● Use when no other options for root cause ● Rogue clients, single request slowdowns, etc.
  • 26. ● Kafka Controller Deep Dive ● Kafka Needs No Keeper ● ISR updates aren’t happening ● Replication stopped for no reason ● Broker refusing to become leader for extended period ● rmr /controller Controller Issues
  • 27. The Offsets Topic ● __consumer_offsets → Stores offsets and group metadata ● bin/kafka-run-class kafka.tools.DumpLogSegments --offsets-decoder ● offsets.retention.minutes default 1 day (7 days on 2.0) ● Growing huge? Check the log cleaner health ● High Memory Usage on Brokers? Offsets are cached ● High CPU Usage on Brokers? Make sure this topic isn’t huge ● Please replication factor 3
  • 28. Restarting? Think about it... ● Restarts in distributed systems are tricky ● Have a reason ● Understand startup time ● Clean shutdowns, limited scope
  • 29. 29 References • https://www.confluent.io/resources/kafka-the-definitive-guide/ • https://www.confluent.io/kafka-summit-san-francisco-2019/please-upgrade- apache-kafka-now • https://conferences.oreilly.com/strata/strata-ny- 2018/public/schedule/detail/68957 • https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-needs-no- keeper • https://events.confluent.io/meetups