SlideShare a Scribd company logo
1 of 39
How Netflix Leverages Multiple Regions to Increase
Availability: Isthmus and Active-Active Case Study
Ruslan Meshenberg
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Failure
Assumptions
Everything Is Broken

Hardware Will Fail
•
•

•
•

Telcos
Scale

Enterprise IT
•
•

Rapid Change
Large Scale

•
•

Slowly Changing
Large Scale

Rapid Change
Small Scale

Web Scale
Startups

Slowly Changing
Small Scale

Everything Works

Software Will Fail
Speed
Incidents – Impact and Mitigation
Public relations
media impact
High customer
service calls
Affects AB
test results

PR
X Incidents

Y incidents mitigated by Active
Active, game day practicing

CS
XX Incidents
Metrics Impact – Feature Disable
XXX Incidents

YY incidents
mitigated by better
tools and practices
YYY incidents
mitigated by better
data tagging

No Impact – Fast Retry or Automated Failover
XXXX Incidents
Does an Instance Fail?
• It can, plan for it
• Bad code / configuration pushes
• Latent issues
• Hardware failure
• Test with Chaos Monkey
Does a Zone Fail?
• Rarely, but happened before
• Routing issues
• DC-specific issues
• App-specific issues within a zone
• Test with Chaos Gorilla
Does a Region Fail?
• Full region – unlikely, very rare
• Individual Services can fail region-wide
• Most likely, a region-wide configuration issue

• Test with Chaos Kong
Everything Fails… Eventually
• The good news is you can do something about it
• Keep your services running by embracing
isolation and redundancy
Cloud Native
A New Engineering Challenge
Construct a highly agile and highly
available service from ephemeral and
assumed broken components
Isolation
• Changes in one region should not affect others
• Regional outage should not affect others
• Network partitioning between regions should not
affect functionality / operations
Redundancy
• Make more than one (of pretty much everything)
• Specifically, distribute services across
Availability Zones and regions
History: X-mas Eve 2012
• Netflix multi-hour outage
• US-East1 regional Elastic Load Balancing issue
• “...data was deleted by a maintenance process that
was inadvertently run against the production ELB
state data”
• “The ELB issue affected less than 7% of running
ELBs and prevented other ELBs from scaling.”
Isthmus – Normal Operation

US-East ELB

US-West-2 ELB

ELB
Zuul
Infrastructure*

Zone A

Tunnel

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas
Isthmus – Failover

US-East ELB

US-West-2 ELB

ELB
Zuul
Infrastructure*

Zone A

Tunnel

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas
Isthmus
Zuul – Overview

Elastic Load Balancing

Elastic Load Balancing

Elastic Load Balancing
Zuul – Details
Denominator – Abstracting the DNS Layer

Amazon
Route 53

DynECT
DNS

UltraDNS

Denominator

Regional Load Balancers

ELBs
Zone A
Cassandra
Replicas

Zone B
Cassandra
Replicas

Zone C
Cassandra
Replicas

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas
Isthmus – Only for Elastic Load Balancing Failures

• Other services may fail region-wide
• Not worthwhile to develop one-offs for each one
Active-Active – Full Regional Resiliency

Regional Load Balancers

Regional Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas
Active-Active – Failover

Regional Load Balancers

Regional Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas
Active-Active Architecture
Separating the Data – Eventual Consistency
• 2–4 region Cassandra clusters
• Eventual consistency != hopeful consistency
Highly Available NoSQL Storage

A highly scalable, available, and
durable deployment pattern based on
Apache Cassandra
Benchmarking Global Cassandra
Write intensive test of cross-region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total
192 TB of SSD in six locations up and running Cassandra in 20 minutes
Test
Load

Validation
Load

1 Million Reads
after 500 ms
CL.ONE with No
Data Loss

US-West-2 Region - Oregon

Test
Load

1 Million Writes
CL.ONE (Wait for
One Replica to ack)

US-East-1 Region - Virginia

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Interzone Traffic

Interregion Traffic
Up to 9Gbits/s, 83ms

18 TB backups
from S3
Propagating EVCache Invalidations
4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch.
Goes cross-region through ELB over HTTPS

Drainer
Writer

7. Return keys
that were successful

3. Read
from SQS
in batches

Drainer
Writer

EVCache Replication
Service

EVCache Replication
Service

6. Deletes the value for the
key
EVCache Replication
Metadata

5. Checks write time to ensure
this is the latest operation for
the key

US-WEST-2

EVCAC
EVCAC
EVCACHE
HE
HE

8. Delete keys from SQS that were
successful
1. Set data in
EVCACHE

EVCACH
EVCACH
EE
EVCACHE

EVCache Replication
Metadata

2. Write events to SQS and
EVCACHE_REGION_REPLICATION

SQS
EVCache
Client

App Server

US-EAST-1
Archaius – Region-isolated Configuration
Running Isthmus and Active-Active
Multiregional Monkeys
• Detect failure to deploy
• Differences in configuration
• Resource differences
Multiregional Monitoring and Alerting
• Per region metrics
• Global aggregation
• Anomaly detection
Failover and Fallback
• DNS (denominator) changes
• For fallback, ensure data consistency
• Some challenges
– Cold cache
– Autoscaling

• Automate, automate, automate
Validating the Whole Thing Works
Dev-Ops in N Regions
• Best practices: avoiding peak times for
deployment
• Early problem detection / rollbacks
• Automated canaries / continuous delivery
Hyperscale Architecture

Amazon
Route 53

DynECT
DNS

UltraDNS

DNS
Automation

Regional Load Balancers

Regional Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas
Does It Work?
Building Blocks Available on Netflix Github site
Topic

Session #

When

What an Enterprise Can Learn from Netflix, a Cloud-native Company

ENT203

Thursday, Nov 14, 4:15 PM - 5:15 PM

Maximizing Audience Engagement in Media Delivery

MED303

Thursday, Nov 14, 4:15 PM - 5:15 PM

Scaling your Analytics with Amazon Elastic MapReduce

BDT301

Thursday, Nov 14, 4:15 PM - 5:15 PM

Automated Media Workflows in the Cloud

MED304

Thursday, Nov 14, 5:30 PM - 6:30 PM

Deft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce
for Monitoring at Gigascale

BDT302

Thursday, Nov 14, 5:30 PM - 6:30 PM

Encryption and Key Management in AWS

SEC304

Friday, Nov 15, 9:00 AM - 10:00 AM

Your Linux AMI: Optimization and Performance

CPN302

Friday, Nov 15, 11:30 AM - 12:30 PM
Takeaways
Embrace isolation and redundancy for availability
NetflixOSS helps everyone to become cloud native
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
@rusmeshenberg @NetflixOSS
We are sincerely eager to hear
your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form
when you have a chance.

More Related Content

What's hot

The Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQLThe Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQL
Datadog
 

What's hot (19)

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS
 
EVCache at Netflix
EVCache at NetflixEVCache at Netflix
EVCache at Netflix
 
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
 
Building and running cloud native cassandra
Building and running cloud native cassandraBuilding and running cloud native cassandra
Building and running cloud native cassandra
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
RedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech StackRedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech Stack
 
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
 
Fact-Based Monitoring - PuppetConf 2014
Fact-Based Monitoring - PuppetConf 2014Fact-Based Monitoring - PuppetConf 2014
Fact-Based Monitoring - PuppetConf 2014
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
The Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQLThe Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQL
 

Viewers also liked

Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
DataWorks Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 

Viewers also liked (20)

Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Recsys 2014 Keynote: The Value of Better Recommendations - For Businesses, Co...
Recsys 2014 Keynote: The Value of Better Recommendations - For Businesses, Co...Recsys 2014 Keynote: The Value of Better Recommendations - For Businesses, Co...
Recsys 2014 Keynote: The Value of Better Recommendations - For Businesses, Co...
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
 
Java Microservices with Netflix OSS & Spring
Java Microservices with Netflix OSS & Spring Java Microservices with Netflix OSS & Spring
Java Microservices with Netflix OSS & Spring
 
Microservices with Netflix OSS and Spring Cloud - Dev Day Orange
Microservices with Netflix OSS and Spring Cloud -  Dev Day OrangeMicroservices with Netflix OSS and Spring Cloud -  Dev Day Orange
Microservices with Netflix OSS and Spring Cloud - Dev Day Orange
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 
What an Enterprise Can Learn from Netflix, a Cloud-native Company (ENT203) | ...
What an Enterprise Can Learn from Netflix, a Cloud-native Company (ENT203) | ...What an Enterprise Can Learn from Netflix, a Cloud-native Company (ENT203) | ...
What an Enterprise Can Learn from Netflix, a Cloud-native Company (ENT203) | ...
 
Scalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the TradeScalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the Trade
 
Application Networks: Microservices and APIs at Netflix
Application Networks: Microservices and APIs at NetflixApplication Networks: Microservices and APIs at Netflix
Application Networks: Microservices and APIs at Netflix
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
Netflix oss season 1 episode 3
Netflix oss season 1 episode 3 Netflix oss season 1 episode 3
Netflix oss season 1 episode 3
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSCloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Encryption and key management in AWS (SEC304) | AWS re:Invent 2013
Encryption and key management in AWS (SEC304) | AWS re:Invent 2013Encryption and key management in AWS (SEC304) | AWS re:Invent 2013
Encryption and key management in AWS (SEC304) | AWS re:Invent 2013
 

Similar to Arc305 how netflix leverages multiple regions to increase availability an isthmus and active-active case study

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
~Eric Principe
 

Similar to Arc305 how netflix leverages multiple regions to increase availability an isthmus and active-active case study (20)

Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
 
#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture
 
AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...
AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...
AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
AWS re:Invent 2013 Recap
AWS re:Invent 2013 RecapAWS re:Invent 2013 Recap
AWS re:Invent 2013 Recap
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the ScenesCassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
 
(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS
 
Cosmos db
Cosmos dbCosmos db
Cosmos db
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the Cloud
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
 
osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
 

More from Ruslan Meshenberg

Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
NetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmap
Ruslan Meshenberg
 

More from Ruslan Meshenberg (10)

NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2
 
Netflix oss past-present-future
Netflix oss   past-present-futureNetflix oss   past-present-future
Netflix oss past-present-future
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-final
 
NetflixOSS season 2 episode 2 - Reactive / Async
NetflixOSS   season 2 episode 2 - Reactive / AsyncNetflixOSS   season 2 episode 2 - Reactive / Async
NetflixOSS season 2 episode 2 - Reactive / Async
 
OSS Think Tank - NetflixOSS - OSS as a Competitive Differentiator
OSS Think Tank - NetflixOSS - OSS as a Competitive DifferentiatorOSS Think Tank - NetflixOSS - OSS as a Competitive Differentiator
OSS Think Tank - NetflixOSS - OSS as a Competitive Differentiator
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
NetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmap
 
re:Invent 2012 Optimizing Cassandra
re:Invent 2012 Optimizing Cassandrare:Invent 2012 Optimizing Cassandra
re:Invent 2012 Optimizing Cassandra
 
The Netflix Open Source Platform
The Netflix Open Source PlatformThe Netflix Open Source Platform
The Netflix Open Source Platform
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Arc305 how netflix leverages multiple regions to increase availability an isthmus and active-active case study

  • 1. How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study Ruslan Meshenberg November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 3. Assumptions Everything Is Broken Hardware Will Fail • • • • Telcos Scale Enterprise IT • • Rapid Change Large Scale • • Slowly Changing Large Scale Rapid Change Small Scale Web Scale Startups Slowly Changing Small Scale Everything Works Software Will Fail Speed
  • 4. Incidents – Impact and Mitigation Public relations media impact High customer service calls Affects AB test results PR X Incidents Y incidents mitigated by Active Active, game day practicing CS XX Incidents Metrics Impact – Feature Disable XXX Incidents YY incidents mitigated by better tools and practices YYY incidents mitigated by better data tagging No Impact – Fast Retry or Automated Failover XXXX Incidents
  • 5. Does an Instance Fail? • It can, plan for it • Bad code / configuration pushes • Latent issues • Hardware failure • Test with Chaos Monkey
  • 6. Does a Zone Fail? • Rarely, but happened before • Routing issues • DC-specific issues • App-specific issues within a zone • Test with Chaos Gorilla
  • 7. Does a Region Fail? • Full region – unlikely, very rare • Individual Services can fail region-wide • Most likely, a region-wide configuration issue • Test with Chaos Kong
  • 8. Everything Fails… Eventually • The good news is you can do something about it • Keep your services running by embracing isolation and redundancy
  • 9. Cloud Native A New Engineering Challenge Construct a highly agile and highly available service from ephemeral and assumed broken components
  • 10. Isolation • Changes in one region should not affect others • Regional outage should not affect others • Network partitioning between regions should not affect functionality / operations
  • 11. Redundancy • Make more than one (of pretty much everything) • Specifically, distribute services across Availability Zones and regions
  • 12. History: X-mas Eve 2012 • Netflix multi-hour outage • US-East1 regional Elastic Load Balancing issue • “...data was deleted by a maintenance process that was inadvertently run against the production ELB state data” • “The ELB issue affected less than 7% of running ELBs and prevented other ELBs from scaling.”
  • 13. Isthmus – Normal Operation US-East ELB US-West-2 ELB ELB Zuul Infrastructure* Zone A Tunnel Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 14. Isthmus – Failover US-East ELB US-West-2 ELB ELB Zuul Infrastructure* Zone A Tunnel Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 16. Zuul – Overview Elastic Load Balancing Elastic Load Balancing Elastic Load Balancing
  • 18. Denominator – Abstracting the DNS Layer Amazon Route 53 DynECT DNS UltraDNS Denominator Regional Load Balancers ELBs Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 19. Isthmus – Only for Elastic Load Balancing Failures • Other services may fail region-wide • Not worthwhile to develop one-offs for each one
  • 20. Active-Active – Full Regional Resiliency Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 21. Active-Active – Failover Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 23. Separating the Data – Eventual Consistency • 2–4 region Cassandra clusters • Eventual consistency != hopeful consistency
  • 24. Highly Available NoSQL Storage A highly scalable, available, and durable deployment pattern based on Apache Cassandra
  • 25. Benchmarking Global Cassandra Write intensive test of cross-region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes Test Load Validation Load 1 Million Reads after 500 ms CL.ONE with No Data Loss US-West-2 Region - Oregon Test Load 1 Million Writes CL.ONE (Wait for One Replica to ack) US-East-1 Region - Virginia Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Interzone Traffic Interregion Traffic Up to 9Gbits/s, 83ms 18 TB backups from S3
  • 26. Propagating EVCache Invalidations 4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch. Goes cross-region through ELB over HTTPS Drainer Writer 7. Return keys that were successful 3. Read from SQS in batches Drainer Writer EVCache Replication Service EVCache Replication Service 6. Deletes the value for the key EVCache Replication Metadata 5. Checks write time to ensure this is the latest operation for the key US-WEST-2 EVCAC EVCAC EVCACHE HE HE 8. Delete keys from SQS that were successful 1. Set data in EVCACHE EVCACH EVCACH EE EVCACHE EVCache Replication Metadata 2. Write events to SQS and EVCACHE_REGION_REPLICATION SQS EVCache Client App Server US-EAST-1
  • 28. Running Isthmus and Active-Active
  • 29. Multiregional Monkeys • Detect failure to deploy • Differences in configuration • Resource differences
  • 30. Multiregional Monitoring and Alerting • Per region metrics • Global aggregation • Anomaly detection
  • 31. Failover and Fallback • DNS (denominator) changes • For fallback, ensure data consistency • Some challenges – Cold cache – Autoscaling • Automate, automate, automate
  • 32. Validating the Whole Thing Works
  • 33. Dev-Ops in N Regions • Best practices: avoiding peak times for deployment • Early problem detection / rollbacks • Automated canaries / continuous delivery
  • 34. Hyperscale Architecture Amazon Route 53 DynECT DNS UltraDNS DNS Automation Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 36. Building Blocks Available on Netflix Github site
  • 37. Topic Session # When What an Enterprise Can Learn from Netflix, a Cloud-native Company ENT203 Thursday, Nov 14, 4:15 PM - 5:15 PM Maximizing Audience Engagement in Media Delivery MED303 Thursday, Nov 14, 4:15 PM - 5:15 PM Scaling your Analytics with Amazon Elastic MapReduce BDT301 Thursday, Nov 14, 4:15 PM - 5:15 PM Automated Media Workflows in the Cloud MED304 Thursday, Nov 14, 5:30 PM - 6:30 PM Deft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce for Monitoring at Gigascale BDT302 Thursday, Nov 14, 5:30 PM - 6:30 PM Encryption and Key Management in AWS SEC304 Friday, Nov 15, 9:00 AM - 10:00 AM Your Linux AMI: Optimization and Performance CPN302 Friday, Nov 15, 11:30 AM - 12:30 PM
  • 38. Takeaways Embrace isolation and redundancy for availability NetflixOSS helps everyone to become cloud native http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix @rusmeshenberg @NetflixOSS
  • 39. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.

Editor's Notes

  1. We’ve built cross vendor DNS automationNetflixOSS Denominator - Global deployment in minutes, robust, agile, denormalized, NoSQL. Triple replicated across availability zones, remote replication across regions. Ability to survive failure of any one zone with no impact. Loss of a whole region and half the customers are redirected to the working region. Failure of a DNS vendor, switch config to a different vendor. Proven scale to well over 10,000 instances with code pushes and autoscaling varying by thousands a day.