SlideShare une entreprise Scribd logo
1  sur  46
@garethbowles
I Don’t Test Often …
… But When I Do, I Test in Production
@garethbowles
Building distributed systems is hard
esting them exhaustively is even harde
@garethbowles
Gareth Bowles
@garethbowles
@garethbowles
television network with more than 57 million
members in 50 countries enjoying more
than one billion hours of TV shows and
movies per month.
We account for up to 34% of downstream
US internet traffic. Source: http://ir.netflix.com
@garethbowles
Personaliza-
tion Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
2B requests
per day
into the Netflix
API
12B outbound
requests per
day to API
dependencies
@garethbowles
A Complex Distributed
System
@garethbowles
Our Deployment Platform
@garethbowles
What AWS Provides
• Machine Images (AMI)
• Instances (EC2)
• Elastic Load Balancers
• Security groups / Autoscaling
groups
• Availability zones and regions
@garethbowles
…No (single) body knows how everything
works.
Our system has become so complex
that…
@garethbowles
How AWS Can Go Wrong -1
• Service goes down in one or more
availability zones
• 6/29/12 - storm related power outage
caused loss of EC2 and RDS instances
in Eastern US
• https://gigaom.com/2012/06/29/some-of-
amazon-web-services-are-down-again/
@garethbowles
How AWS Can Go Wrong - 2
• Loss of service in an entire region
• 12/24/12 - operator error caused loss of
multiple ELBs in Eastern US
• http://techblog.netflix.com/2012/12/a-
closer-look-at-christmas-eve-outage.html
@garethbowles
How AWS Can Go Wrong - 3
• Large number of instances get rebooted
• 9/25/14 to 9/30/14 - rolling reboot of
1000s of instances to patch a security
bug
• http://techblog.netflix.com/2014/10/a-
state-of-xen-chaos-monkey-
cassandra.html
@garethbowles
Our Goal is Availability
• Members can stream Netflix whenever
they want
• New users can explore and sign up
• New members can activate their service
and add devices
@garethbowles
http://www.slideshare.net/reed2001/culture-1798664
@garethbowles
Freedom and Responsibility
• Developers deploy when
they want
• They also manage their
own capacity and
autoscaling
• And are on-call to fix
anything that breaks at
3am!
@garethbowles
How the heck do you test this stuff
?
@garethbowles
Failure is All Around Us
• Disks fail
• Power goes out - and your backup
generator fails
• Software bugs are introduced
• People make mistakes
@garethbowles
Design to Avoid Failure
• Exception handling
• Redundancy
• Fallback or degraded experience (circuit
breakers)
• But is it enough ?
@garethbowles
It’s Not Enough
• How do we know we’ve succeeded ?
• Does the system work as designed ?
• Is it as resilient as we believe ?
• How do we avoid drifting into failure ?
@garethbowles
More Testing !
• Unit
• Integration
• Stress
• Exhaustive testing to simulate all failure
modes
@garethbowles
Exhaustive Testing ~
Impossible
• Massive, rapidly changing data sets
• Internet scale traffic
• Complex interaction and information flow
• Independently-controlled services
• All while innovating and building features
@garethbowles
Another Way
• Cause failure deliberately to validate
resiliency
• Test design assumptions by stressing
them
• Don’t wait for random failure. Remove
its uncertainty by forcing it regularly
@garethbowles
Introducing the Army
@garethbowles
Chaos Monkey
@garethbowles
Chaos Monkey
• The original Monkey (2009)
• Randomly terminates instances in a cluster
• Simulates failures inherent to running in the
cloud
• During business hours
• Default for production services
@garethbowles
What did we do once we were
able to handle Chaos Monkey ?
Bring in bigger Monkeys !
@garethbowles
Chaos Gorilla
@garethbowles
Chaos Gorilla
• Simulate an Availability Zone becoming
unavailable
• Validate multi-AZ redundancy
• Deploy to multiple AZs by default
• Run regularly (but not continually !)
@garethbowles
Chaos Kong
@garethbowles
Chaos Kong
• “One louder” than Chaos Gorilla
• Simulate an entire region outage
• Used to validate our “active-active” region
strategy
• Traffic has to be switched to the new region
• Run once every few months
@garethbowles
Latency Monkey
@garethbowles
Latency Monkey• Simulate degraded instances
• Ensure degradation doesn’t affect other
services
• Multiple scenarios: network, CPU, I/O,
memory
• Validate that your service can handle
degradation
• Find effects on other services, then validate
that they can handle it too
@garethbowles
Conformity Monkey
@garethbowles
Conformity Monkey• Apply a set of conformity rules to all
instances
• Notify owners with a list of instances and
problems
• Example rules
• Standard security groups not applied
• Instance age is too old
• No health check URL
@garethbowles
Failure Injection Testing (FIT)
• Latency Monkey adds delay / failure on server side of
requests
• Impacts all calling apps - whether they want to
participate or not
• FIT decorates requests with failure data
• Can limit failures to specific accounts or devices, then
dial up
• http://techblog.netflix.com/2014/10/fit-failure-injection-
testing.html
@garethbowles
Try it out !
• Open sourced and available at
https://github.com/Netflix/SimianArmy
and
https://github.com/Netflix/security_monke
y
• Chaos, Conformity, Janitor and Security
available now; more to come
• VMware as well as AWS
@garethbowles
What’s Next ?
• New failure modes
• Run monkeys more frequently and
aggressively
• Make chaos testing as well-understood
as regular regression testing
@garethbowles
A message from the owners
“Use Chaos Monkey to induce various kinds of
failures in a controlled environment.”
AWS blog post following the mass instance
reboot in Sep 2014:
http://aws.amazon.com/blogs/aws/ec2-
maintenance-update-2/
@garethbowles
Production Code Coverage
If you don’t run code coverage analysis in prod,
you’re not doing it properly
@garethbowles
What We Get
• Real time code usage patterns
• Focus testing by prioritizing frequently
executed paths with low test coverage
• Identify dead code that can be removed
@garethbowles
How We Do It
• Use Cobertura as it counts how many
times each LOC is executed
• Easy to enable - Cobertura JARs
included in our base AMI, set a flag to
add them to Tomcat’s classpath
• Enable on a single instance
• Very low performance hit
@garethbowles
Canary Deployments
@garethbowles
Canaries
• Push changes to a small number of
instances
• Use Asgard for red / black push
• Monitor closely to detect regressions
• Automated canary analysis
• Automatically cancel deployment if
problems occur
@garethbowles
Closing Thoughts
• Don’t be scared to test in production !
• You’ll get tons of data that you couldn’t get
from test …
• … and hopefully sleep better at night
@garethbowles
Thanks, QA or the Highway !
Email: gbowles@{gmail,netflix}.com
Twitter: @garethbowles
Linkedin:
www.linkedin.com/in/garethbowles

Contenu connexe

Tendances

Driving application development through behavior driven development
Driving application development through behavior driven developmentDriving application development through behavior driven development
Driving application development through behavior driven development
Einar Ingebrigtsen
 

Tendances (20)

Keynote AST 2016
Keynote AST 2016Keynote AST 2016
Keynote AST 2016
 
Sap inside track Munich 2017
Sap inside track Munich 2017Sap inside track Munich 2017
Sap inside track Munich 2017
 
Test management struggles and challenges in SDLC
Test management struggles and challenges in SDLCTest management struggles and challenges in SDLC
Test management struggles and challenges in SDLC
 
Test Driven Development with Laravel
Test Driven Development with LaravelTest Driven Development with Laravel
Test Driven Development with Laravel
 
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
 
Sustainable Automation Frameworks by Kelsey Shannahan
Sustainable Automation Frameworks by Kelsey ShannahanSustainable Automation Frameworks by Kelsey Shannahan
Sustainable Automation Frameworks by Kelsey Shannahan
 
Driving application development through behavior driven development
Driving application development through behavior driven developmentDriving application development through behavior driven development
Driving application development through behavior driven development
 
Growing Manual Testers into Automators
Growing Manual Testers into AutomatorsGrowing Manual Testers into Automators
Growing Manual Testers into Automators
 
The limits of unit testing by Craig Stuntz
The limits of unit testing by Craig StuntzThe limits of unit testing by Craig Stuntz
The limits of unit testing by Craig Stuntz
 
Automated Visual Regression Testing by Dave Sadlon
Automated Visual Regression Testing by Dave SadlonAutomated Visual Regression Testing by Dave Sadlon
Automated Visual Regression Testing by Dave Sadlon
 
A/B Testing at Scale: Minimizing UI Complexity (SXSW 2015)
A/B Testing at Scale: Minimizing UI Complexity (SXSW 2015)A/B Testing at Scale: Minimizing UI Complexity (SXSW 2015)
A/B Testing at Scale: Minimizing UI Complexity (SXSW 2015)
 
Promoting Agility with Running Tested Features - Lightening Talk
Promoting Agility with Running Tested Features - Lightening TalkPromoting Agility with Running Tested Features - Lightening Talk
Promoting Agility with Running Tested Features - Lightening Talk
 
Developer Night - Opticon18
Developer Night - Opticon18Developer Night - Opticon18
Developer Night - Opticon18
 
How to get Automated Testing "Done"
How to get Automated Testing "Done"How to get Automated Testing "Done"
How to get Automated Testing "Done"
 
Test pyramid
Test pyramidTest pyramid
Test pyramid
 
Test Automation Pyramid
Test Automation PyramidTest Automation Pyramid
Test Automation Pyramid
 
Hey You Got Your TDD in my SQL DB by Jeff McKenzie
Hey You Got Your TDD in my SQL DB by Jeff McKenzieHey You Got Your TDD in my SQL DB by Jeff McKenzie
Hey You Got Your TDD in my SQL DB by Jeff McKenzie
 
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan LippsMyth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
 
Continuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsContinuous Delivery: The Dirty Details
Continuous Delivery: The Dirty Details
 
Leandro Melendez - Switching Performance Left & Right
Leandro Melendez - Switching Performance Left & RightLeandro Melendez - Switching Performance Left & Right
Leandro Melendez - Switching Performance Left & Right
 

Similaire à I don't always test...but when I do I test in production - Gareth Bowles

Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)
Tomas Doran
 

Similaire à I don't always test...but when I do I test in production - Gareth Bowles (20)

Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
 
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
 
Mini-Training: Netflix Simian Army
Mini-Training: Netflix Simian ArmyMini-Training: Netflix Simian Army
Mini-Training: Netflix Simian Army
 
London Atlassian User Group - February 2014
London Atlassian User Group - February 2014London Atlassian User Group - February 2014
London Atlassian User Group - February 2014
 
Cloud patterns forwardjs April Ottawa 2019
Cloud patterns forwardjs April Ottawa 2019Cloud patterns forwardjs April Ottawa 2019
Cloud patterns forwardjs April Ottawa 2019
 
Continuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSSContinuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSS
 
8 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 20188 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 2018
 
Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)
 
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
 
Tech Talk on Cloud Computing
Tech Talk on Cloud ComputingTech Talk on Cloud Computing
Tech Talk on Cloud Computing
 
Creating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsCreating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applications
 
Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
 
Immutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answerImmutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answer
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
 
Continuous Deployment of your Application - SpringOne Tour Dallas
Continuous Deployment of your Application - SpringOne Tour DallasContinuous Deployment of your Application - SpringOne Tour Dallas
Continuous Deployment of your Application - SpringOne Tour Dallas
 
Continuous Deployment of your Application @SpringOne
Continuous Deployment of your Application @SpringOneContinuous Deployment of your Application @SpringOne
Continuous Deployment of your Application @SpringOne
 
ChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptxChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptx
 
Why real integration developers ride Camels
Why real integration developers ride CamelsWhy real integration developers ride Camels
Why real integration developers ride Camels
 

Plus de QA or the Highway

Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdfJeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
QA or the Highway
 

Plus de QA or the Highway (20)

KrishnaToolComparisionPPT.pdf
KrishnaToolComparisionPPT.pdfKrishnaToolComparisionPPT.pdf
KrishnaToolComparisionPPT.pdf
 
Ravi Lakkavalli - World Quality Report.pptx
Ravi Lakkavalli - World Quality Report.pptxRavi Lakkavalli - World Quality Report.pptx
Ravi Lakkavalli - World Quality Report.pptx
 
Caleb Crandall - Testing Between the Buckets.pptx
Caleb Crandall - Testing Between the Buckets.pptxCaleb Crandall - Testing Between the Buckets.pptx
Caleb Crandall - Testing Between the Buckets.pptx
 
Thomas Haver - Mobile Testing.pdf
Thomas Haver - Mobile Testing.pdfThomas Haver - Mobile Testing.pdf
Thomas Haver - Mobile Testing.pdf
 
Thomas Haver - Example Mapping.pdf
Thomas Haver - Example Mapping.pdfThomas Haver - Example Mapping.pdf
Thomas Haver - Example Mapping.pdf
 
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdfJoe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
 
Sarah Geisinger - Continious Testing Metrics That Matter.pdf
Sarah Geisinger - Continious Testing Metrics That Matter.pdfSarah Geisinger - Continious Testing Metrics That Matter.pdf
Sarah Geisinger - Continious Testing Metrics That Matter.pdf
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfJeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdf
 
Leandro Melendez - Chihuahua Load Tests.pdf
Leandro Melendez - Chihuahua Load Tests.pdfLeandro Melendez - Chihuahua Load Tests.pdf
Leandro Melendez - Chihuahua Load Tests.pdf
 
Rick Clymer - Incident Management.pdf
Rick Clymer - Incident Management.pdfRick Clymer - Incident Management.pdf
Rick Clymer - Incident Management.pdf
 
Robert Fornal - ChatGPT as a Testing Tool.pptx
Robert Fornal - ChatGPT as a Testing Tool.pptxRobert Fornal - ChatGPT as a Testing Tool.pptx
Robert Fornal - ChatGPT as a Testing Tool.pptx
 
Federico Toledo - Extra-functional testing.pdf
Federico Toledo - Extra-functional testing.pdfFederico Toledo - Extra-functional testing.pdf
Federico Toledo - Extra-functional testing.pdf
 
Andrew Knight - Managing the Test Data Nightmare.pptx
Andrew Knight - Managing the Test Data Nightmare.pptxAndrew Knight - Managing the Test Data Nightmare.pptx
Andrew Knight - Managing the Test Data Nightmare.pptx
 
Melissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdfMelissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdf
 
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdfJeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
 
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptxDesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
 
Damian Synadinos - Word Smatter.pdf
Damian Synadinos - Word Smatter.pdfDamian Synadinos - Word Smatter.pdf
Damian Synadinos - Word Smatter.pdf
 
Lee Barnes - What Successful Test Automation is.pdf
Lee Barnes - What Successful Test Automation is.pdfLee Barnes - What Successful Test Automation is.pdf
Lee Barnes - What Successful Test Automation is.pdf
 
Jordan Powell - API Testing with Cypress.pptx
Jordan Powell - API Testing with Cypress.pptxJordan Powell - API Testing with Cypress.pptx
Jordan Powell - API Testing with Cypress.pptx
 
Carlos Kidman - Exploring AI Applications in Testing.pptx
Carlos Kidman - Exploring AI Applications in Testing.pptxCarlos Kidman - Exploring AI Applications in Testing.pptx
Carlos Kidman - Exploring AI Applications in Testing.pptx
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

I don't always test...but when I do I test in production - Gareth Bowles

Notes de l'éditeur

  1. Notes
  2. Hi, everyone. Thanks for coming today. This is my first keynote, so I hope I can do it justice ! Just to set some expectations - I heard that at Matt’s keynote last year, he was giving out $20 bills. I’m afraid I don’t have any cash, but I do have plenty of snazzy Simian Army stickers up here. I’m going to talk about some big testing challenges that we face at Netflix. In particular, we have such a large and complex distributed system that that testing it exhaustively in an isolated environment is next to impossible. To meet that challenge we came up with a few different approaches, and I’m going to talk about three of those today: the Simian Army, which is a set of tools that induces failures in production; code coverage analysis on production servers; and using canaries to test new versions in production. I’ll spend a bit of time talking about Netflix and our streaming service to set up the problem space, then go over some of the things we need to test for and how more traditional test practices can fall short. Finally I’ll go into some detail on the tools themselves.
  3. A little bit about me. I’ve been with Netflix for 4 1/2 years and I’m part of our Engineering Tools team. We’re responsible for developer productivity, with the goal that any engineer can build and deploy their code with the minimum possible effort. Before Netflix I spent a long time in test engineering and technical operations, so once I got to Netflix I was fascinated to see how such a complex system gets tested. Let’s take a look at how it all works.
  4. Who here ISN’T familiar with Netflix ? Any customers ? Thanks very much ! Netflix is first & foremost an entertainment company, but you can also look at us as an engineering company that creates all the technology to serve up that entertainment, and also collects a ton of data on who watches what, when they watch it, and how much they watch. We continuously analyze all that data to improve our customers’ entertainment experience by making it easy for them to find things they want to watch, and making sure they have a top quality viewing experience when they get comfy on the sofa (or the bus, or in the park, or wherever they can get connected). So there’s a lot of engineering that goes on behind the scenes to make all that possible.
  5. Some data that might be new to some of you guys. Our membership is growing fast; about two thirds of our members are in the US, but we’re now in more than 50 other countries - all countries in North and South America, plus most of Europe. The amount of content our viewers are watching is growing, too. And we’re doing our best to break the internet.
  6. This is an overview of our current architecture. 2 billion requests flow in from all kinds of connected devices - game consoles, PCs and Macs, phones, tablets, TVs, DVD players, and more. Those generate 12 billion outbound requests to individual services. The diagram shows some of the main ones: the personalization engine that recommends what to watch based on your viewing history and ratings, movie metadata to give you information about what you’re watching, and one of the most important - the A/B test engine that lets us serve up a different customer experience to a set of users and measure its effect on how much they watch, what they watch and when the watch it.
  7. I like to pause for audience reaction here. Anyone care to suggest what this diagram shows ? We call it the “Shock and Awe” diagram at Netflix ! It’s generated by one of our monitoring systems and shows the interconnections between all of our services and data. Don’t look too hard, you won’t make out much detail - it’s only meant to illustrate the complexity of the system.
  8. We run our production systems on Amazon Web Services, which many of you are probably familiar with. We’re one of AWS’ biggest customers - apart from Amazon itself, who uses AWS to power their e-commerce sites as well as their streaming service that competes with Netflix.
  9. Using Amazon Web Services lets us stop worrying about procurement of hardware - servers, network switches, storage, firewalls, load balancers ... AWS allows us to scale up and down without worrying about exceeding or underusing data center capacity. And since every AWS service has an API, we can automate our deployments and throw away all those runbooks. Each AWS service is available in multiple regions (geographic areas) and multiple availability zones (data centers) within each region.
  10. So now that I’ve described how everything works at a really high level, here’s a big problem - in actual fact, there’s nobody who knows how it all works in depth. Although we have world experts in areas such as personalization, video encoding and machine learning, there’s just too much going on and it’s changing too fast for any individual to keep up.
  11. AWS is increasingly reliable, but it has had some fairly spectacular outages as well as many smaller ones. When you run on a cloud platform that’s not under your control, you have to be able to cope with these outages. On June 29th 2012, a storm caused a widespread power outage in Northern Virginia that took out many instances and database servers. Netflix streaming was affected for a while.
  12. An even bigger outage happened on Christmas Eve 2012, when an Amazon engineer made a mistake that took out many Elastic Load Balancers in the US East region, which at that time was Netflix’ primary region for serving traffic. ELBs aren’t replicated between availability zones, they only apply to a given region. Many Netflix customers were affected; luckily for us, Christmas Eve is a much less busy day than Christmas Day when everyone gets given Netflix subscriptions, and the problem was fixed by the 25th.
  13. In late September this year, AWS restarted a large number of instances, in multiple regions, to patch a security bug in the virtualization software that the instances run on. This time we were hardly affected at all and there was negligible impact on customers - again, more in our tech blog post.
  14. Given those kinds of problems, we need to work pretty hard to deal with them. Netflix wants our 50 million plus members to be able to play movies and TV shows whenever and wherever they want. We also want to make it as simple and fast as possible for people to sign up and start using their new subscriptions.
  15. One little digression that’s an important part of how we meet these challenges - we couldn’t do it without our company culture, which we’re quite proud of - proud enough to publish the 126-slide deck I’ve linked here. All those slides can be boiled down to one key takeaway, which we call Freedom and Responsibility.
  16. Here’s how the “freedom & responsibility” principle applies to our technical development and deployment. Freedom - engineers deploy when and how often they need to, and control their own production capacity and scaling. Responsibility - every engineer in each service team is in the PagerDuty rotation in case things go wrong.
  17. So how can these teams be confident that their new versions will still work with all their dependencies, under all kinds of failure conditions ? It’s a very tough problem.
  18. Failure is unavoidable. We already saw some ways that our AWS platform can fail. Add to that bugs in our own software, and the inevitable human errors that you get when there are actual people involved in the development and deployment pipeline.
  19. We can do a lot to make our code handle failure gracefully. We can catch errors as exceptions and make sure they are handled in a way that doesn’t crash the code. We can run multiple instances of our services to avoid single points of failure. And we can use technologies such as the circuit breaker pattern to have services provide a degraded experience if one of their dependent services goes offline. For example, if our recommendation service is unavailable, we don’t show you a blank list of recommendations on your Netflix page - we fall back to a list of the most-watched content.
  20. But that only gets us so far. We want to make sure all our features work properly without waiting for customers to tell us they don’t. We want to know that we are as resilient as we think we are, without waiting for an outage to happen. Given the scale we run at, it’s effectively impossible to create a realistic test system for running load and reliability tests. We also want to be sure that the configuration of our services doesn’t diverge as we redeploy them over time - this can lead to errors that are very hard to debug given the thousands of instances that we run in production.
  21. So, most of you guys are testers and probably have this reaction - let’s do more testing ! But can we effectively simulate such a large- scale distributed system - and what’s more, can we predict every possible failure mode and encode it into our tests ?
  22. Today’s large internet systems have become too big and complex to just rely on traditional testing - but don’t get me wrong, all those types of testing I just mentioned have a very important place at Netflix, and we have some of the best test engineers in the business working on them. But here are some of the things they struggle with. It’s very hard to find realistic test data. It would be hugely expensive for us to create a similarly-sized copy of our production system for testing - not quite “copying the internet”, but getting there. Because teams deploy their own changes on different schedules, it’s difficult to keep up with changes and code them into integration tests.
  23. So we came up with the idea of deliberately triggering failures in production, to augment our more traditional testing. By causing our own failures on a known schedule, we can be prepared to deal with their effects and test our assumptions in a predictable way, rather than having a fire drill when a “real” outage happens.
  24. So with all that context done with, let’s take a look at the lovable monkeys who make up our Simian Army.
  25. Chaos Monkey was the one who started it all.
  26. Chaos Monkey has been around in some form for about 5 years. It’s a service that looks for groups of instances (known as clusters) of each of our services and picks a random instance to terminate, on a defined schedule and with a defined probability of termination. This simulates a fairly frequent thing in AWS (although not nearly as frequent as it used to be) where instances are terminated unexpectedly, usually due to a failure in the underlying hardware. We run Chaos Monkey during business hours so that engineers are on hand to diagnose and fix problems,rather than getting a 3am page. If you deploy a new Netflix service, Chaos Monkey will be enabled for it unless you explicitly turn it off. We’ve got to a point where Chaos Monkey instance terminations go virtually unnoticed.
  27. We didn’t want to stop once we were happy that we could deal with individual instances dying.
  28. Gorillas are bigger than monkeys and can carry bigger weapons.
  29. Chaos Gorilla takes out an entire Availability Zone. AWS has multiple regions in different parts of the world, such Eastern USA, Western USA, Asia Pacific and Western Europe. Each region has multiple Availability Zones. These are equivalent to physical data centers in different geographic locations - for example, the US East region is located in Virginia but has 3 separate Availability Zones. Running Chaos Gorilla ensures that our service is running correctly in multiple Availability Zones, and that we have sufficient capacity in each zone to handle our traffic load. Runs of Chaos Gorilla are announced ahead of time, and our Reliability Engineering team sets up an incident room where engineers from each service team can watch progress.
  30. So what next ? As we already picked the gorilla, we had to resort to a fictional creature to cause even bigger chaos.
  31. Once we were happy that we could survive an Availability Zone outage, we wanted to go a step further and see if we could cope with an entire region being taken out. This hasn’t happened in reality yet, but there’s a small possibility that it could - for example, the us-west-1 region is in Northern California, so a really big earthquake could feasibly take out all of its Availability Zones. And the Elastic Load Balancer outage at the end of 2012 did have the effect of bringing down a key service in an entire region. To handle this, we had to rearchitect to an “active-active” setup where we have complete copies of our services and data running in two different regions. If an Availability Zone goes down we just have to make sure we have enough capacity in the surviving zones to handle all our traffic, but if we lose a region we also have to reroute all traffic to the backup region. Chaos Kong gets an outing every few months, again with an incident room where engineers can watch progress and react to any problems.
  32. So we can deal with instances disappearing individually, or in large numbers. But what if instances are still there, but running in a degraded state ? Because our architecture involves so many interdependent services, we have to be careful that problems with one service don’t cascade to other services.
  33. This is where Latency Monkey comes in. It can simulate multiple types of degradation: network connections maxing out, high CPU loads, high disk I/O and running out of memory. Degradation in a service oriented architecture is extremely hard to test exhaustively. With Latency Monkey we can introduce degradation in a controlled way (like the recommendations example I mentioned earlier), find any problems in dependent services and fix them, then verify those fixes. Some service teams even discovered dependencies they didn’t know they had by running Latency Monkey and finding unexpected degradation in the dependent services.
  34. One other key aspect of running so many services, all with dozens or hundreds of instances, is that it’s very important to keep all of the instances consistent. They should all have the same system configuration and the same version and configuration of the service, for example. This decreases the complexity / surface area of the testing
  35. We use Conformity Monkey to automate these checks. It runs over all services at a fixed interval, and notifies service owners when any of the conditions are not met. An email is sent containing a list of rule violations, each with a list of non-conforming instances. Here are a few examples of the things we check: Instances should be in the correct security groups so that they are reachable by other services and our monitoring and deployment tools. Instances shouldn’t have been running for more than a given time, which depends on how often the service is deployed. Older instances could be running an out of date version of the service. Instances should all have a valid health check URL so that our monitoring tools can know whether they are running properly.
  36. We just came up with a system called FIT, for Failure Injection Testing. We need to come up with a monkey for this one ! Latency Monkey injects delays or failures on the server side and thus affects all calling services. If all those calling services don’t have proper fallbacks and timeouts implemented, they can stop working and impact customers - which we obviously want to avoid. FIT allows failures to be simulated on the client side. We can add failure data to a specific set of API calls and propagate it through the system so that only the services we want to test are affected. We usually start with a specific test customer account or a particular client device, then dial up the failures to affect more and more production traffic if the initial results look good. Check the Netflix Tech Blog for more details.
  37. You can try the Monkeys out for yourself by going to the Netflix GitHub page. There’s an active community of users, and Netflix engineers regularly monitor the mailing list. Chaos, Conformity, Janitor and Security Monkeys are currently available, with more to come. For those of you not using AWS but with an in house VMWare setup - you can use the Monkeys too, thanks to some of our open source contributors.
  38. There’s a lot more we want to do with the Monkeys. We’d like to have a way to induce failures that are more chaotic than the individual instances that Chaos Monkey knocks out, but less impactful than having Chaos Gorilla take out an entire availability zone. We’re constantly on the lookout for new failure modes that we can trigger - hopefully before they happen in the wild. We’re working on an effort to increase the frequency and reach of monkey runs. This is in response to some interesting data - our uptime degraded when we ran the monkeys less frequently. Eventually, we’d like to make chaos testing in large distributed systems as well understood and commonly practiced as regular regression testing.
  39. After the mass instance reboot I mentioned early, Amazon themselves recommended that AWS customers use Chaos Monkey to test their resilience. High praise indeed !
  40. Let’s move on to our second way of testing in production - code coverage analysis. Our view is that if you’re not doing it in prod, you’re missing out on a ton of useful data.
  41. In contrast to just showing what code paths are covered by test, we get data on the paths which are actually used in production, plus how often each path is run. We compare the production code coverage data with the results from our test environment. This enables us to focus our testing, for example by identifying commonly used code paths with low test coverage. We can also find dead code that is never used in production, and remove that code and its tests to make maintenance easier.
  42. All of our services run on the JVM, so we needed a Java code coverage tool. We picked Cobertura because it counts how many times each line of code is executed, in contrast to most other tools which just give you a binary result of whether or not the line was executed. We put the Cobertura JAR files on our base machine image that all of the services build on, and set a flag at runtime to enable code coverage analysis. We’ll usually run code coverage on a single instance in a service cluster, and leave it running for a day or so to make sure that all the code paths are hit. We’ve found the performance hit to be very low - typically less than 5% degradation in performance.
  43. Our third way of testing in production is the use of canary deployments. The term came from coal mining, when miners wanted to detect dangerous levels of coal gas in the mine shaft. They would take down a canary, which was very sensitive to the gas; if the canary keeled over or died, it was time to get out of there before the miners did the same.
  44. Automated rollback happens near beginning of canary if there is a drastic regression - we’re still learning what is “bad enough”, varies by team. If regression is less severe, team will make decision to put the new service in prod, or cancel the push. Each team can have a different level of tolerance for regressions. Teams that deploy very frequently will tend to have a lower tolerance than teams who deploy less often and maybe don’t have their automated analysis fully developed.
  45. And that’s the end of my talk; before we go to some questions, I’d like to give a big thanks to the organizers for putting on such a great conference, and to all of you for coming. I’ve been really impressed by what a great testing community you have here.