(SPOT302) Availability: The New Kind of Innovator’s Dilemma

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Coburn Watson, Director of Performance and Reliability, Netflix
October 2015
SPOT302
Availability
The New Kind of Innovators Dilemma

@coburnw
• Cloud performance and reliability @ Netflix
• Reduce time-to-detect and time-to-resolve
• Optimize usage of AWS cloud
• Steer global user traffic and support failover
• Inject chaos into production environment
• Build innovative performance analysis tooling
• Drive operational best practice adoption

• 67M+ subscribers
• > 50 countries
• > 3 billion hours of video streamed monthly
• Massive cloud footprint
• Homegrown CDN
• Strong Originals slate

Atlas
https://netflix.github.io/

What to Expect from the Session
• Strategies
• Maximizing engineering velocity in the cloud
• Minimizing risks to availability

The cloud is a journey
…not a destination*
* Adapted from Ralph Waldo Emerson

2008
2010
2011
2013
2015
Datacenter
Failure
Serving off
AWS US-EAST-1
Three AZ
Deployments
Serving off
AWS EU-WEST-1
Chaos Monkey
Unleashed
Serving from
AWS US-WEST-2
Running
Active-Active
Chaos Kong
Unleashed
Last Application
to the Cloud
Active-Active in
three AWS regions
The Netflix Cloud Journey

Shifting the Curve@Netflix
• Maintain or improve availability as engineering velocity increases

Maximize Engineering Velocity
"FA-18 Hornet breaking sound barrier (7 July 1999) - filtered" by Ensign John Gay, U.S. Navy

Infrastructure on Demand
• No procurement process
• “all you can eat” **
• Expose IaaS via Spinnaker
• No passwords, no keys
** please don’t eat all of it 

Accelerate Code Deployment
• Commit-to-cloud in minutes
• Across three AWS regions

Decouple Services
• µservice architecture (500+ @Netflix)
• One Auto Scaling group per service
• Independent push schedules (1day  4weeks)
• Communicate via API
• Independent databases (280+ Cassandra clusters)
• Minimize aggregate rate of change
• Update code which needs updating…

Minimize Risks to Availability
“If everything seems under control, you're not going fast enough.”
― Mario Andretti

Maximize Infrastructure Stability
• Run on AWS
• Purchase 3-year EC2 Reserved instances (for failover as well)
• Distribute Auto Scaling groups across 3 Availability Zones per region

Propagate Changes Safely into Production
• Rolling regional “red-black” pushes
• Build pipelines & automated canary analysis
• 30 second time-to-detect on critical metrics

• Rigorous quality and performance checks part of code push
• Canary score is the gate for push
Automated Canary Analysis

Cross-Service Resiliency
• Isolate misbehaving services
• Open “circuits” and provide fallback experiences
Normal
(personalized)
Degraded
(unpersonalized)

Improve Time-To-Detect
• 30 second alerts vs. prior 8 minutes
• Utilize streaming analysis infrastructure at the edge tier

Dynamically Provision Capacity
• Reactively scale Auto Scaling groups

Flexibility in Traffic Management
• Target three primary AWS regions
• Maintain capacity to allow regional evacuation

Frequently Exercise “Chaos”
• Netflix runs regional failover exercises monthly
• Can you spot the chaos?

Frequently Exercise “Chaos”
• Validates
• Failover correctness
• Capacity
• Failover velocity
• Confidence in usage
(same time window as previous slide)

Continually Lower Operational Barriers
• “Production Ready” Program
• Identify operational best practices
• Develop tooling
• Consult with engineering teams
• Identify reliability “anti-patterns”…address
• Example key areas
• Auto Scaling, Hystrix tuning, alerting,
automated Canary analysis, Apache/Tomcat tuning

Regional Isolation
Push-induced failure

Automated Service Fallbacks
• Downstream service issue; fallbacks gracefully applied

….but what about efficiency?
..That’s a separate talk altogether

Wrapping it Up
• “To the cloud” – a journey
• Abstract complexity via platform
• Don’t be afraid to break things
• Break things intentionally and frequently
• Invest in reliability to support increased innovation
• Hire top talent

Related Sessions
Talk Speaker When? Where?
Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N
Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park Wed @ 2:45pm Palazzo C
Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million
Events Per Second
Peter Bakas Wed @ 2:45pm
San Polo
3501B
A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn Wed @ 4:15pm Venetian H
Real-Time Analytics In Service of Self-Healing Ecosystems
Roy Rapoport
Chris Sanden
Wed @ 4:15pm Lido 3001B
Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F
Splitting the Check on Compliance and Security: Keeping Developers and
Auditors Happy in the Cloud
Jason Chan Thu @ 11am
Marcello
4501B

Remember to complete
your evaluations!

(SPOT302) Availability: The New Kind of Innovator’s Dilemma

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Similaire à (SPOT302) Availability: The New Kind of Innovator’s Dilemma (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

(SPOT302) Availability: The New Kind of Innovator’s Dilemma