3. KAFKA@GS
What If? This presentation is boring
• Topology Strategy
• What If? Hosts / Network / DC Failures
• Deployment & Monitoring Strategy
• DevOps Tooling (Dashboards / Emails /
Timeseries)
4. KAFKA@GS
DEPLOYMENT USAGE PATTERNS
• Most used clusters serve ~1.5Tb a week to
consumers
• However message count relatively low – order
of millions per week; avg. several hundred a
second
• At peak periods
• ~1,500 messages produced/second
• ~2.5Mb produced/second
• ~12.5Mb consumed/second
5. KAFKA@GS
DEPLOYMENT GOALS
• No data-loss even in case of DC outage
• No Primary/Back-Up notion
• No “failover”
• Minimize Outage scenarios
• Single Logical Cluster
7. KAFKA@GS
H1
Datacenter A
Partition assignment
EXAMPLE Single Topic Setup
• 1 topic
• 3 partitions (p1-p3)
• Replication factor of 4
• Min.Insync.Replicas 3
• Ensure even replicas between
DCs
• Cross-DC latency is low!
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
8. KAFKA@GS
H1
Datacenter A
What If? A host fails
• 1-5 times / year
• No impact to
producers/consumers (still
able to satisfy 3 ISR)
• No manual recovery beyond
replacing host
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
9. KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts
are (e.g. bad hypervisor)
• Processing for some topics will
be halted
• Short-term: Add replicas for
affected partitions on
remaining hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no
need to re-point dns
aliases/change kafka config
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
10. KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts
are (e.g. bad hypervisor)
• Processing for some topics will
be halted
• Short-term: Add replicas for
affected partitions on
remaining hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no
need to re-point dns
aliases/change kafka config
H3
p1r1
H5
p2 p3
Datacenter B
H2
p1
H4
p1
p1r1
H6
p3
p2
p2 p3
p1
pr2 p3
p1
p2 p3
p1
p1
11. KAFKA@GS
H1
Datacenter A
What If? Three hosts fail
• 1 / few years
• Cluster processing halted as
cannot satisfy in-sync replica
requirements
• Proceed with immediate host
replacement
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
12. KAFKA@GS
H1
Datacenter A
What If? Datacenter Failed / Network
Partition
• Once a 20-year event
• Short-term strategy: add
additional machines in
Datacenter A
• Largest impact on recovery
time is how long to get new
hosts provisioned
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
14. KAFKA@GS
Kafka
Broker
What’s running on each VM
DEPLOYMENT STRATEGY
Zookeep
er
REST
Service
Metric
s
Captur
e
Standard processes Optional extras
Kafka
Connec
t
Mirror
Maker
(Cross-
Region)
Schem
a
Registr
y
+
Apache Kafka distributionConfluent distribution with patchesGS developed
15. KAFKA@GS
DEPLOYMENT STRATEGY
• Cluster configuration is defined in code (Slang, but could be
JSON etc.)
• Required files (broker .properties, zk id files etc.) generated
from config
• All processes generated from same config
• This is where optional extras e.g. KafkaConnnect can be
added
• Job generating output is loaded into GS process management
[Procmon] system, which executes jobs
• Several advantages:
• Config is incorporated into GS SDLC – code reviews, VCS
store, audit etc.
• Write regression tests against job generation – easily
catch unintentional changes
• Trivial to spin up new clusters, or modify existing, and
deploy quickly in controlled manner
16. KAFKA@GS
TOOLING OVERVIEW
• Written a lot of code to help manage clusters
• Control and oversight to topic creation and ongoing
management
• Bespoke healthchecks
• Monitoring website & APIs
18. KAFKA@GS
TOOLING TOPIC MANAGEMENT
• Wanted a controlled manner for clients to
add/configure their topics
• Changes should be easily reviewable and history
stored for audit
• Use Slang configuration per cluster
• Integrated with code review & VCS
• Users can add topics and configure overrides
• Includes sanity checking
• Automated synchronization job takes released
change and updates cluster
• Also used to mark topic ownership and alerting
contacts
20. KAFKA@GS
TOOLING HEALTHCHECK
• Topic sizes are monitored frequently vs. defined
thresholds in config
• Conceived to alert teams when they might be of risk
of losing data due to truncation
• If partition(s) on topic breach threshold then they are
notified via GS alerting infrastructure
21. KAFKA@GS
TOOLING HEALTHCHECK
• Daily summary of
cluster usage
• Combines data in
cluster with
metadata defined
in config
• Highlights
unowned topics,
topics near size
thresholds etc.
22. KAFKA@GS
TOOLING CLUSTER DASHBOARD
• Website available
with each cluster
we deploy
• Provides cluster
and topic-level
info and stats
• Top-level
healthcheck
23. KAFKA@GS
TOOLING CLUSTER DASHBOARD
Endpoints include:
• View messages on
topic
• Topic config
• Consumer lag
• Leader & ISRs for
topic
• Highwatermark
for topic
• Broker &
zookeeper
configuration
24. KAFKA@GS
TOOLING METRICS IN PULSE
• Metrics logged
into GS Pulse
• Can access raw
data via RESTful
service
• Out of the box UI
(Grafana based)
25. KAFKA@GS
SUMMARY
• Failure will occur, tooling is key
• Belt & Suspenders for everything
• Kafka has many Knobs, perhaps too many,
hide some
• Year+ burn-in period to gain trust
• Never a golden source (yet…)