Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?

KAFKA@GS
pache Kafka in the Enterpri
What if it Fails?
© 2017 Goldman Sachs. This presentation should not be relied upon or considered investment advice. Goldman Sachs does not warrant or
guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their own risk.
This presentation may not be forwarded or disclosed except with this disclaimer intact.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider
reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The
Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of
Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the
Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs.
© Copyright 2017 The Goldman Sachs Group, Inc. All rights reserved.

KAFKA@GS
Prepared by
Dominic Rutter

KAFKA@GS
What If? This presentation is boring
• Topology Strategy
• What If? Hosts / Network / DC Failures
• Deployment & Monitoring Strategy
• DevOps Tooling (Dashboards / Emails /
Timeseries)

KAFKA@GS
DEPLOYMENT USAGE PATTERNS
• Most used clusters serve ~1.5Tb a week to
consumers
• However message count relatively low – order
of millions per week; avg. several hundred a
second
• At peak periods
• ~1,500 messages produced/second
• ~2.5Mb produced/second
• ~12.5Mb consumed/second

KAFKA@GS
DEPLOYMENT GOALS
• No data-loss even in case of DC outage
• No Primary/Back-Up notion
• No “failover”
• Minimize Outage scenarios
• Single Logical Cluster

KAFKA@GS
Broker ZK
Virtual Machine
Datacenter A
Broker ZK
Virtual Machine
Broker ZK
Virtual Machine
Broker ZK
Virtual Machine
Datacenter B
Broker ZK
Virtual Machine
Broker ZK
Virtual Machine
Conceptual cluster
Physical cluster
Broker
ZKZK ZK
BrokerBroker
Broker Broker
Broker
ZKZK ZK
DEPLOYMENT STRATEGY

KAFKA@GS
H1
Datacenter A
Partition assignment
EXAMPLE Single Topic Setup
• 1 topic
• 3 partitions (p1-p3)
• Replication factor of 4
• Min.Insync.Replicas 3
• Ensure even replicas between
DCs
• Cross-DC latency is low!
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2

KAFKA@GS
H1
Datacenter A
What If? A host fails
• 1-5 times / year
• No impact to
producers/consumers (still
able to satisfy 3 ISR)
• No manual recovery beyond
replacing host
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2

KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts
are (e.g. bad hypervisor)
• Processing for some topics will
be halted
• Short-term: Add replicas for
affected partitions on
remaining hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no
need to re-point dns
aliases/change kafka config
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2

KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts
are (e.g. bad hypervisor)
• Processing for some topics will
be halted
• Short-term: Add replicas for
affected partitions on
remaining hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no
need to re-point dns
aliases/change kafka config
H3
p1r1
H5
p2 p3
Datacenter B
H2
p1
H4
p1
p1r1
H6
p3
p2
p2 p3
p1
pr2 p3
p1
p2 p3
p1
p1

KAFKA@GS
H1
Datacenter A
What If? Three hosts fail
• 1 / few years
• Cluster processing halted as
cannot satisfy in-sync replica
requirements
• Proceed with immediate host
replacement
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2

KAFKA@GS
H1
Datacenter A
What If? Datacenter Failed / Network
Partition
• Once a 20-year event
• Short-term strategy: add
additional machines in
Datacenter A
• Largest impact on recovery
time is how long to get new
hosts provisioned
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2

KAFKA@GS
H1
Datacenter A
After adding additional hosts
What If? Datacenter Failed / Network
Partition
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
H7
p1
H8
p1
p1r1
H9
p2
p3
p3
p2

KAFKA@GS
Kafka
Broker
What’s running on each VM
DEPLOYMENT STRATEGY
Zookeep
er
REST
Service
Metric
s
Captur
e
Standard processes Optional extras
Kafka
Connec
t
Mirror
Maker
(Cross-
Region)
Schem
a
Registr
y
+
Apache Kafka distributionConfluent distribution with patchesGS developed

KAFKA@GS
DEPLOYMENT STRATEGY
• Cluster configuration is defined in code (Slang, but could be
JSON etc.)
• Required files (broker .properties, zk id files etc.) generated
from config
• All processes generated from same config
• This is where optional extras e.g. KafkaConnnect can be
added
• Job generating output is loaded into GS process management
[Procmon] system, which executes jobs
• Several advantages:
• Config is incorporated into GS SDLC – code reviews, VCS
store, audit etc.
• Write regression tests against job generation – easily
catch unintentional changes
• Trivial to spin up new clusters, or modify existing, and
deploy quickly in controlled manner

KAFKA@GS
TOOLING OVERVIEW
• Written a lot of code to help manage clusters
• Control and oversight to topic creation and ongoing
management
• Bespoke healthchecks
• Monitoring website & APIs

KAFKA@GS
Kafka
Broker
TOOLING OVERVIEW
Zookeep
er
REST
Service
Metric
s
Captur
e
PULSE
TsDb
GS CENTRALIZED
ALERTING
INFRASTRUCTURE
PULSE WEBSERVICE
Phone/e-
mail/IM alerts
to team

KAFKA@GS
TOOLING TOPIC MANAGEMENT
• Wanted a controlled manner for clients to
add/configure their topics
• Changes should be easily reviewable and history
stored for audit
• Use Slang configuration per cluster
• Integrated with code review & VCS
• Users can add topics and configure overrides
• Includes sanity checking
• Automated synchronization job takes released
change and updates cluster
• Also used to mark topic ownership and alerting
contacts

KAFKA@GS
TOOLING TOPIC MANAGEMENT
Example config
{
"Owner", gs-team-name@gs.com
"Partitions", 5,
"Replication Factor", 4,
"Config”, {
“retention.ms”, 10 * 24 * 60 * 60 * 1000,
“retention.bytes”, 1024 * 1024 * 1024 * 15,
"Size Limits", {
“Mail Alerting Threshold”, 70%,
“Fabric Alerting Threshold”, 90%
}
”Alerting”, {
teamA, “56f15e17498e00434ded85fc”
}

KAFKA@GS
TOOLING HEALTHCHECK
• Topic sizes are monitored frequently vs. defined
thresholds in config
• Conceived to alert teams when they might be of risk
of losing data due to truncation
• If partition(s) on topic breach threshold then they are
notified via GS alerting infrastructure

KAFKA@GS
TOOLING HEALTHCHECK
• Daily summary of
cluster usage
• Combines data in
cluster with
metadata defined
in config
• Highlights
unowned topics,
topics near size
thresholds etc.

KAFKA@GS
TOOLING CLUSTER DASHBOARD
• Website available
with each cluster
we deploy
• Provides cluster
and topic-level
info and stats
• Top-level
healthcheck

KAFKA@GS
TOOLING CLUSTER DASHBOARD
Endpoints include:
• View messages on
topic
• Topic config
• Consumer lag
• Leader & ISRs for
topic
• Highwatermark
for topic
• Broker &
zookeeper
configuration

KAFKA@GS
TOOLING METRICS IN PULSE
• Metrics logged
into GS Pulse
• Can access raw
data via RESTful
service
• Out of the box UI
(Grafana based)

KAFKA@GS
SUMMARY
• Failure will occur, tooling is key
• Belt & Suspenders for everything
• Kafka has many Knobs, perhaps too many,
hide some
• Year+ burn-in period to gain trust
• Never a golden source (yet…)

Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?

Similaire à Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? (20)

Plus de confluent

Plus de confluent (20)

Dernier

Dernier (20)

Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?