SlideShare une entreprise Scribd logo
1  sur  35
1
One Data Center is Not Enough
Scale and Availability of Apache Kafka in Multiple Data Centers
@gwenshap
2
3
Bad Things
• Kafka cluster failure
• Major storage / network outage
• Entire DC is demolished
• Floods and Earthquakes
4
Disaster Recovery Plan:
“When in trouble
or in doubt
run in circles,
scream and shout”
5
Disaster Recovery Plan:
When This Happens Do That
Kafka cluster failure Failover to a second cluster in same data center
Major storage / network Outage Failover to a second cluster in another “zone” in
same building
Entire data-center is demolished Single Kafka cluster running in multiple near-by
data-centers / buildings.
Flood and Earthquakes Failover to a second cluster in another region
6
There is no such thing
as a free lunch
Anyone who tells you differently
is selling something.
7
Reality:
The same event will not
appear in two DCs at the
exact same time.
8
Things to ask:
• What are the guarantees in an event of unplanned failover?
• What are the guarantees in an event of planned failover?
• Does the product actually guarantee what you think?
• What is the process for failing back?
• What is required to implement this solution?
• How does the solution impact my production performance?
9
Every solution needs to
balance these trade offs
Kafka takes DIY approach
10
The inherent complexity of multi data-center replication
There is a diversity of approaches
And diversity of problems
Kafka gives you the flexibility and tools to work
And we’ll give you an example and inspire you to build your own
List tradeoffs here
Here are things to watch out for:
How to do your homework
Tweet me J
11
Stretch Cluster
The easy way
• Take 3 nearby data centers.
• Single digit ms latency is good
• Install at least 1 Zookeeper in each
• Install at least one Kafka broker in each
• Configure each DC as a “rack”
• Configure acks=all, min.isr=2
• Enjoy
12
Diagram!
13
Pros
• Easy to set up
• Failover is “business as usual”
• Sync replication – only method to guarantee
no loss of data.
Cons
• Need 3 data centers nearby
• Cluster failure is still a disaster
• Higher latency, lower throughput compared
to “normal” cluster
• Traffic between DCs can be bottleneck
• Costly infrastructure
14
Want sync replication but
only two data centers?
15
Solution I hesistate because…
2 ZK nodes in each DC and “observer”
somewhere else.
Did anyone do this before?
3 ZK nodes in each DC and manually
reconfigure quorum for failover
• You may lose ZK updates during
failover
• Requires manual intervention2 separate ZK cluster + replication
Solutions I can’t recommend:
16
Most companies don’t do stretch.
Because:
• Only 2 data centers
• Data centers are far
• One cluster isn’t safe enough
• Not into “high latency”
17
So you want to run
2 Kafka clusters
And replicate
events
between them?
18
Basic async replication
19
Replication Lag
20
Demo #1
Monitoring Replication Lag
21
22
Active-Active or
Active-Passive?
• Active-Active is efficient
you use both DCs
• Active-Active is easier
because both clusters are
equivalent
• Active-Passive has lower
network traffic
• Active-Passive requires less
monitoring
23
Active-Active Setup
24
Disaster Strikes
25
Desired Post-Disaster State
26
Only one question left:
What does it consume next?
27
Kafka
consumers
normally use
offsets
28
In an ideal world…
29
Unfortunately, this is not that simple
1. There is no guarantee that offsets are identical in the two data centers.
Event with offset 26 in NYC can be offset 6 or offset 30 in ATL.
2. Replication of each topic and partition is independent. So..
1. Offset metadata may arrive ahead of events themselves
2. Offset metadata may arrive late
Nothing prevents you from replicating offsets topic and using it. Just be realistic
about the guarantees.
30
If accuracy is no big-deal…
1. If duplicates are cool – start from the beginning.
Use Cases:
• Writing to a DB
• Anything idempotent
• Sending emails or alerts to people inside the company
2. If lost events are cool – jump to the latest event.
Use Cases:
• Clickstream analytics
• Log analytics
• “Big data” and analytics use-cases
31
Personal Favorite – Time-based Failover
• Offsets are not identical, but…
3pm is 3pm (within clock drift)
• Relies on new features:
• Timestamps in events! 0.10.0.0
• Time-based indexes! 0.10.1.0
• Force consumer to timestamps tool! 0.11.0.0
32
How we do it?
1. Detect Kafka in NYC is down. Check the time of the incident.
• Even better:
Use an interceptor to track timestamps of events as they are consumed.
Now you know “last consumed time-stamp”
2. Run Consumer Groups tool in ATL and set the offsets for “following-orders”
consumer to time of incident (or “last consumed time”)
3. Start the ”following-orders” consumer in ATL
4. Have a beer. You just aced your annual failover drill.
33
bin/kafka-consumer-groups
--bootstrap-server localhost:29092
--reset-offsets
--topic NYC.orders
--group following-orders
--execute
--to-datetime 2017-08-22T06:00:33.236
34
Few practicalities
• Above all – practice
• Constantly monitor replication lag. High enough lag and everything is useless.
• Also monitor replicator for liveness, errors, etc.
• Chances are the line to the remote DC is both high latency and low throughput.
Prepare to do some work to tune the producers/consumers of the replicator.
• RTFM: http://docs.confluent.io/3.3.0/multi-dc/replicator-tuning.html
• Replicator plays nice with containers and auto-scale. Give it a try.
• Call your legal dept. You may be required to encrypt everything you replicate.
• Watch different versions of this talk. We discuss more architectures and more ops concerns.
35
Thank You!

Contenu connexe

Tendances

Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
confluent
 

Tendances (20)

Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at ScaleKafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
 
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
 
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSPutting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 

Similaire à Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Across Multiple Data Centers

Building a Smarter Application Stack
Building a Smarter Application StackBuilding a Smarter Application Stack
Building a Smarter Application Stack
Docker, Inc.
 

Similaire à Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Across Multiple Data Centers (20)

Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
The Highs and Lows of Stateful Containers
The Highs and Lows of Stateful ContainersThe Highs and Lows of Stateful Containers
The Highs and Lows of Stateful Containers
 
Building a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for DockerBuilding a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for Docker
 
Building a smarter application Stack by Tomas Doran from Yelp
Building a smarter application Stack by Tomas Doran from YelpBuilding a smarter application Stack by Tomas Doran from Yelp
Building a smarter application Stack by Tomas Doran from Yelp
 
Building a Smarter Application Stack
Building a Smarter Application StackBuilding a Smarter Application Stack
Building a Smarter Application Stack
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
How to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALABHow to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALAB
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
 

Plus de confluent

Plus de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Dernier

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 

Dernier (20)

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Across Multiple Data Centers

  • 1. 1 One Data Center is Not Enough Scale and Availability of Apache Kafka in Multiple Data Centers @gwenshap
  • 2. 2
  • 3. 3 Bad Things • Kafka cluster failure • Major storage / network outage • Entire DC is demolished • Floods and Earthquakes
  • 4. 4 Disaster Recovery Plan: “When in trouble or in doubt run in circles, scream and shout”
  • 5. 5 Disaster Recovery Plan: When This Happens Do That Kafka cluster failure Failover to a second cluster in same data center Major storage / network Outage Failover to a second cluster in another “zone” in same building Entire data-center is demolished Single Kafka cluster running in multiple near-by data-centers / buildings. Flood and Earthquakes Failover to a second cluster in another region
  • 6. 6 There is no such thing as a free lunch Anyone who tells you differently is selling something.
  • 7. 7 Reality: The same event will not appear in two DCs at the exact same time.
  • 8. 8 Things to ask: • What are the guarantees in an event of unplanned failover? • What are the guarantees in an event of planned failover? • Does the product actually guarantee what you think? • What is the process for failing back? • What is required to implement this solution? • How does the solution impact my production performance?
  • 9. 9 Every solution needs to balance these trade offs Kafka takes DIY approach
  • 10. 10 The inherent complexity of multi data-center replication There is a diversity of approaches And diversity of problems Kafka gives you the flexibility and tools to work And we’ll give you an example and inspire you to build your own List tradeoffs here Here are things to watch out for: How to do your homework Tweet me J
  • 11. 11 Stretch Cluster The easy way • Take 3 nearby data centers. • Single digit ms latency is good • Install at least 1 Zookeeper in each • Install at least one Kafka broker in each • Configure each DC as a “rack” • Configure acks=all, min.isr=2 • Enjoy
  • 13. 13 Pros • Easy to set up • Failover is “business as usual” • Sync replication – only method to guarantee no loss of data. Cons • Need 3 data centers nearby • Cluster failure is still a disaster • Higher latency, lower throughput compared to “normal” cluster • Traffic between DCs can be bottleneck • Costly infrastructure
  • 14. 14 Want sync replication but only two data centers?
  • 15. 15 Solution I hesistate because… 2 ZK nodes in each DC and “observer” somewhere else. Did anyone do this before? 3 ZK nodes in each DC and manually reconfigure quorum for failover • You may lose ZK updates during failover • Requires manual intervention2 separate ZK cluster + replication Solutions I can’t recommend:
  • 16. 16 Most companies don’t do stretch. Because: • Only 2 data centers • Data centers are far • One cluster isn’t safe enough • Not into “high latency”
  • 17. 17 So you want to run 2 Kafka clusters And replicate events between them?
  • 21. 21
  • 22. 22 Active-Active or Active-Passive? • Active-Active is efficient you use both DCs • Active-Active is easier because both clusters are equivalent • Active-Passive has lower network traffic • Active-Passive requires less monitoring
  • 26. 26 Only one question left: What does it consume next?
  • 28. 28 In an ideal world…
  • 29. 29 Unfortunately, this is not that simple 1. There is no guarantee that offsets are identical in the two data centers. Event with offset 26 in NYC can be offset 6 or offset 30 in ATL. 2. Replication of each topic and partition is independent. So.. 1. Offset metadata may arrive ahead of events themselves 2. Offset metadata may arrive late Nothing prevents you from replicating offsets topic and using it. Just be realistic about the guarantees.
  • 30. 30 If accuracy is no big-deal… 1. If duplicates are cool – start from the beginning. Use Cases: • Writing to a DB • Anything idempotent • Sending emails or alerts to people inside the company 2. If lost events are cool – jump to the latest event. Use Cases: • Clickstream analytics • Log analytics • “Big data” and analytics use-cases
  • 31. 31 Personal Favorite – Time-based Failover • Offsets are not identical, but… 3pm is 3pm (within clock drift) • Relies on new features: • Timestamps in events! 0.10.0.0 • Time-based indexes! 0.10.1.0 • Force consumer to timestamps tool! 0.11.0.0
  • 32. 32 How we do it? 1. Detect Kafka in NYC is down. Check the time of the incident. • Even better: Use an interceptor to track timestamps of events as they are consumed. Now you know “last consumed time-stamp” 2. Run Consumer Groups tool in ATL and set the offsets for “following-orders” consumer to time of incident (or “last consumed time”) 3. Start the ”following-orders” consumer in ATL 4. Have a beer. You just aced your annual failover drill.
  • 34. 34 Few practicalities • Above all – practice • Constantly monitor replication lag. High enough lag and everything is useless. • Also monitor replicator for liveness, errors, etc. • Chances are the line to the remote DC is both high latency and low throughput. Prepare to do some work to tune the producers/consumers of the replicator. • RTFM: http://docs.confluent.io/3.3.0/multi-dc/replicator-tuning.html • Replicator plays nice with containers and auto-scale. Give it a try. • Call your legal dept. You may be required to encrypt everything you replicate. • Watch different versions of this talk. We discuss more architectures and more ops concerns.