SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
From	Resilient	to	Antifragile	
Chaos	Engineering	Primer
By	@Sergiu_Bodiu	
Solution	Architect
@Sergiu_Bodiu2
From	Resilient	to	Antifragile	
Chaos	Engineering	Primer
By	@Sergiu_Bodiu	
Solution	Architect				
DevOpsDays
Singapore

Conference
Singapore Spring
User Group
@Sergiu_Bodiu
what	is	an	ARCHITECT
3 https://www.thekua.com/atwork/2016/11/the-well-rounded-architect/@patkua
@Sergiu_Bodiu
Risk	management
4
The	new	normal:	
from RESILIENT
to ANTIFRAGILE
@Sergiu_Bodiu
A	new	way	to	look	at	organizations
5
Fragile:		At	risk	of	total	failure	/	financial	ruin	
Resilient:	Takes	damage,	avoids	total	failure,	
recovers	
Robust:	Absorbs	uncertainty,	repels	blows,	
avoids	damage	
Antifragile:	Responds	to	stress	by	mutating,	
maintains	fitness	for	purpose.	Identity	Change.
@Sergiu_Bodiu
Blueprint	for	living	in	a	Black	Swan	world.
6
Antifragile
and only
the
Antifragile,
will Make it.
@Sergiu_Bodiu
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
7
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
@Sergiu_Bodiu
Software	is	Single	Point	of	Failure
8
Root Cause Analysis: While component failures such as
NETWORK, STORAGE, SERVER, HARDWARE, and POWER
failures are anticipated and thus guarded with extra
redundancies.
@Sergiu_Bodiu
Distributed	Systems	Complexity
9
Complexity is
like
Addiction…
Case study: How
complexity creeps in
- @jasonfried
https://m.signalvnoise.com/case-study-how-complexity-creeps-in-cba48023e6a1
@Sergiu_Bodiu
Chaos	Engineering
11
Discipline of experimenting on
a distributed system in order to
build confidence in the
system’s capability to withstand
turbulent conditions in
production.
NETFLIX
http://principlesofchaos.org
@Sergiu_Bodiu
Some	outages	in	the	Region
12
SingTel fined a record $6m for Bukit
Panjang exchange fire;
Telstra goes down again, people
can't drink beer or catch Ubers
Amazon Web Services outage
causes Australian website chaos
@Sergiu_Bodiu
Backups
13
"Backups always succeed.
It's the restores that fail.
Test your backups by practicing
restores!"
Using Chaos Monkey
@Sergiu_Bodiu
Netflix	Simian	Army
14
Suite of tools for
keeping your
cloud operating
in top form.
https://github.com/Netflix/
SimianArmy
@Sergiu_Bodiu
Chaos	Monkey
15
1.Active during normal working
hours
2.Break things in production
3.Design better software
services
4.Embracing failure
http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html
@Sergiu_Bodiu
https://github.com/Netflix/security_monkey
16 https://github.com/Netflix/security_monkey
Monitor AWS and GCP
accounts for policy changes
and alerts on insecure
configurations.
Security Monkey can be
extended with custom
account types, custom
watchers, custom auditors,
and custom alerters.
@Sergiu_Bodiu
Other	Monkeys
17
•Latency Monkey
•Janitor Monkey
•Conformity
Monkey
•Doctor Monkey
@Sergiu_Bodiu18
@Sergiu_Bodiu
PRINCIPLES	>	TOOLS	
Why	do	we	do	>	
What	we	do
19
@Sergiu_Bodiu
Dejirafication
20Alexey Krivitsky https://www.slideshare.net/krivitsky/dejirafication-clean-your-process
@Sergiu_Bodiu
Principles	of	Chaos
21
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
Chaos Engineering Whitepaper 2016
@Sergiu_Bodiu
Hypothesize
22
> sudo watch
• Start with steady state behavior.
• Monitor metrics that are visible
• Capture an interaction between the users and the system.
TIP: Utilisationis Virtually Useless as a Metric!
@Sergiu_Bodiu
Vary	Events	
23
> sudo halt
• Terminate virtual machine instances
• Inject latency into requests between services
• Fail requests between services
• Fail an internal microservice
• Make an entire region unavailable
TIP: Select only a subset of users
@Sergiu_Bodiu
Experiment
24
• End to end TESTING (Expensive)
• Process is slow
• Configuration Drfit from Production
• 92% ERRORS could be prevented (Simple)
TIP: Customersdon't behave as your JMeter
script.
https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
@Sergiu_Bodiu
Automate
25
> sudo while (1)
• Distributed systems changes continuously over time.
• Engineers modify the behavior of existing services, add new
services.
• Engineers are changing runtime configuration parameters,
upgrading and patching systems
TIP: Depending on the context, changethe rate of
each experiment.
@Sergiu_Bodiu
Principles	of	Chaos
26
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
TIP: Intentionally breakthings, compare
measured with expected impact, and correct any
problems uncovered this way.
Chaos Engineering Whitepaper 2016
@Sergiu_Bodiu
Reference	Architecture	for	Cloud	Native	Platform
27 https://content.pivotal.io/white-papers/the-upside-down-economics-of-building-your-own-platform
@Sergiu_Bodiu
Pivotal	Cloud	Foundry
28
@Sergiu_Bodiu
Chaos	Lemur	demo
29
Chaos Lemur =
Chaos Monkey + PCF
@Sergiu_Bodiu
Locust	demo
30
Locust	is	an	open-source	Python	load	testing	
framework.	
• Define	user	behaviour	in	code	
• Can	execute	end-to-end	user	test	with	sessions	and	
cookies.	
• Expands	to	multiple	slaves	to	increase	load	capacity	
• Allows	for	distributed	user	paths	based	on	
percentages	
		
Gatling	is	an	open-source	Scala	load	testing	framework	
• High	performance	
• Ready-to-present	HTML	reports	
• Scenario	recorder	and	developer-friendly	DSL
@Sergiu_Bodiu
Lessons	Learned
31
• Systematic approach to Chaos Testing
• This is incredible hard under pressure.
• Don’t wait so long to start load testing.
• The conversations drive new requirements.
• Changing architecture last minute is extremely
dangerous.
• Join the community
• Build relation with Networking Team, Database Team,
Third Party Partners, Vendors etc..
• Make everything Asynchronous (Embrace Failure,
Background Tasks, Retry, Idempotence)
@Sergiu_Bodiu
The	importance	of	reliability
32
Don't trust claims systems make
about themselves & their
dependencies.
Verify by breaking.
@Sergiu_Bodiu
Clean	your	process
33
Culture	>	Principles	>	
Tools	
> Post Mortem
> sudo halt

Incident Start

Impact
@Sergiu_Bodiu
Testing	Pyramid
34https://watirmelon.blog/2012/01/31/introducing-the-software-testing-ice-cream-cone/
@Sergiu_Bodiu
Further	Reading
35
https://www.infoq.com/br/presentations/exercising-failure-at-
netflix	
https://www.infoq.com/podcasts/failure-as-a-service	
https://www.infoq.com/articles/chaos-engineering	
@Ops_Engineering	https://www.youtube.com/watch?
v=CZ3wIuvmHeM	
@caseyrosenthal	https://www.youtube.com/watch?
v=Q4nniyAarbs	
Peter	Alvaro:	Orchestrated	Chaos:	Applying	Failure	Testing	
Research	at	Scale	
Adrian	Colyer	Simple	Testing	Can	Prevent	Most	Critical	Failures
Thank You
@sergiu_bodiu
Questions
@Sergiu_Bodiu
Principles
37
Any	developer	building	applications	
which	run	as	a	service.	Ops	engineers	who	
deploy	or	manage	such	applications.	
https://12factor.net:	
Anyone	working	in	software	that	writes	tests	or	
maintains	continuous	integration	
pipelines.	
http://www.10factor.ci

Contenu connexe

En vedette

En vedette (20)

Hyperledger Technical Community in China.
Hyperledger Technical Community in China. Hyperledger Technical Community in China.
Hyperledger Technical Community in China.
 
OpenStack on AArch64
OpenStack on AArch64OpenStack on AArch64
OpenStack on AArch64
 
Linux Kernel Development
Linux Kernel DevelopmentLinux Kernel Development
Linux Kernel Development
 
Simplify Networking for Containers
Simplify Networking for ContainersSimplify Networking for Containers
Simplify Networking for Containers
 
Running Legacy Applications with Containers
Running Legacy Applications with ContainersRunning Legacy Applications with Containers
Running Legacy Applications with Containers
 
OpenDaylight OpenStack Integration
OpenDaylight OpenStack IntegrationOpenDaylight OpenStack Integration
OpenDaylight OpenStack Integration
 
See what happened with real time kvm when building real time cloud pezhang@re...
See what happened with real time kvm when building real time cloud pezhang@re...See what happened with real time kvm when building real time cloud pezhang@re...
See what happened with real time kvm when building real time cloud pezhang@re...
 
Policy-based Resource Placement
Policy-based Resource PlacementPolicy-based Resource Placement
Policy-based Resource Placement
 
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoTZephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
 
There is NO Open Source Business Model
There is NO Open Source Business ModelThere is NO Open Source Business Model
There is NO Open Source Business Model
 
Quickly Debug VM Failures in OpenStack
Quickly Debug VM Failures in OpenStackQuickly Debug VM Failures in OpenStack
Quickly Debug VM Failures in OpenStack
 
How Open Source Communities do Standardization
How Open Source Communities do StandardizationHow Open Source Communities do Standardization
How Open Source Communities do Standardization
 
SecurityPI - Hardening your IoT endpoints in Home.
SecurityPI - Hardening your IoT endpoints in Home. SecurityPI - Hardening your IoT endpoints in Home.
SecurityPI - Hardening your IoT endpoints in Home.
 
Is there still room for innovation in container orchestration and scheduling
Is there still room for innovation in container orchestration and scheduling Is there still room for innovation in container orchestration and scheduling
Is there still room for innovation in container orchestration and scheduling
 
Get a Taste of 1 k+ Nodes by a Handful of Servers
Get a Taste of 1 k+ Nodes by a Handful of Servers Get a Taste of 1 k+ Nodes by a Handful of Servers
Get a Taste of 1 k+ Nodes by a Handful of Servers
 
Build Robust Blockchain Services with Hyperledger and Containers
Build Robust Blockchain Services with Hyperledger and ContainersBuild Robust Blockchain Services with Hyperledger and Containers
Build Robust Blockchain Services with Hyperledger and Containers
 
Fully automated kubernetes deployment and management
Fully automated kubernetes deployment and managementFully automated kubernetes deployment and management
Fully automated kubernetes deployment and management
 
Unikernelized Linux
Unikernelized LinuxUnikernelized Linux
Unikernelized Linux
 
The Open vSwitch and OVN Projects
The Open vSwitch and OVN ProjectsThe Open vSwitch and OVN Projects
The Open vSwitch and OVN Projects
 
64-bit ARM Unikernels on uKVM
64-bit ARM Unikernels on uKVM64-bit ARM Unikernels on uKVM
64-bit ARM Unikernels on uKVM
 

Similaire à From Resilient to Antifragile Chaos Engineering Primer

Similaire à From Resilient to Antifragile Chaos Engineering Primer (20)

Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Agile & SCRUM basics
Agile & SCRUM basicsAgile & SCRUM basics
Agile & SCRUM basics
 
It's Not Continuous Delivery If You Can't Deploy Right Now
It's Not Continuous Delivery If You Can't Deploy Right NowIt's Not Continuous Delivery If You Can't Deploy Right Now
It's Not Continuous Delivery If You Can't Deploy Right Now
 
IRJET-Methodology for Modular Fixturing System
IRJET-Methodology for Modular Fixturing SystemIRJET-Methodology for Modular Fixturing System
IRJET-Methodology for Modular Fixturing System
 
Life cycle-management-for-oracle-data-integrator-(odi)
Life cycle-management-for-oracle-data-integrator-(odi)Life cycle-management-for-oracle-data-integrator-(odi)
Life cycle-management-for-oracle-data-integrator-(odi)
 
Technology-Driven Development: Using Automation and Development Techniques to...
Technology-Driven Development: Using Automation and Development Techniques to...Technology-Driven Development: Using Automation and Development Techniques to...
Technology-Driven Development: Using Automation and Development Techniques to...
 
Whitepaper life cycle-management-for-odi
Whitepaper life cycle-management-for-odiWhitepaper life cycle-management-for-odi
Whitepaper life cycle-management-for-odi
 
Macroscope 5.0 - Agile Overview
Macroscope 5.0 - Agile OverviewMacroscope 5.0 - Agile Overview
Macroscope 5.0 - Agile Overview
 
Software Development Process Models (SCRUM Methodology)
Software Development Process Models (SCRUM Methodology)Software Development Process Models (SCRUM Methodology)
Software Development Process Models (SCRUM Methodology)
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Introduzione alle metodologie di sviluppo agile
Introduzione alle metodologie di sviluppo agileIntroduzione alle metodologie di sviluppo agile
Introduzione alle metodologie di sviluppo agile
 
Working with Agile technologies and SCRUM
Working with Agile technologies and SCRUMWorking with Agile technologies and SCRUM
Working with Agile technologies and SCRUM
 
Accelerate Microservices Deployments with Automation
Accelerate Microservices Deployments with AutomationAccelerate Microservices Deployments with Automation
Accelerate Microservices Deployments with Automation
 
Lifecycle Model
Lifecycle ModelLifecycle Model
Lifecycle Model
 
Nic 2020 valar kaalan io_t enabled smart mushroom cultivation
Nic 2020 valar kaalan io_t enabled smart mushroom cultivationNic 2020 valar kaalan io_t enabled smart mushroom cultivation
Nic 2020 valar kaalan io_t enabled smart mushroom cultivation
 
Software Development Taxonomy
Software Development TaxonomySoftware Development Taxonomy
Software Development Taxonomy
 
Get to Green: How to Safely Refactor Legacy Code
Get to Green: How to Safely Refactor Legacy CodeGet to Green: How to Safely Refactor Legacy Code
Get to Green: How to Safely Refactor Legacy Code
 
SDLC - Software Development Life Cycle
SDLC - Software Development Life CycleSDLC - Software Development Life Cycle
SDLC - Software Development Life Cycle
 
OPTIMAL LECTURE PLANNING FOR TEACHING THE SUBJECT USING AGILE METHODOLOGY
OPTIMAL LECTURE PLANNING FOR TEACHING THE SUBJECT USING AGILE METHODOLOGYOPTIMAL LECTURE PLANNING FOR TEACHING THE SUBJECT USING AGILE METHODOLOGY
OPTIMAL LECTURE PLANNING FOR TEACHING THE SUBJECT USING AGILE METHODOLOGY
 
PMBOK and Scrum: Best of both worlds
PMBOK and Scrum: Best of both worldsPMBOK and Scrum: Best of both worlds
PMBOK and Scrum: Best of both worlds
 

Plus de LinuxCon ContainerCon CloudOpen China

Plus de LinuxCon ContainerCon CloudOpen China (12)

kdump: usage and_internals
kdump: usage and_internalskdump: usage and_internals
kdump: usage and_internals
 
Building a Better Thermostat
Building a Better ThermostatBuilding a Better Thermostat
Building a Better Thermostat
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 
Secure Containers with EPT Isolation
Secure Containers with EPT IsolationSecure Containers with EPT Isolation
Secure Containers with EPT Isolation
 
Open Source Software Business Models Redux
Open Source Software Business Models ReduxOpen Source Software Business Models Redux
Open Source Software Business Models Redux
 
Introduction to OCI Image Technologies Serving Container
Introduction to OCI Image Technologies Serving ContainerIntroduction to OCI Image Technologies Serving Container
Introduction to OCI Image Technologies Serving Container
 
Rebuild - Simplifying Embedded and IoT Development Using Linux Containers
Rebuild - Simplifying Embedded and IoT Development Using Linux ContainersRebuild - Simplifying Embedded and IoT Development Using Linux Containers
Rebuild - Simplifying Embedded and IoT Development Using Linux Containers
 
OCI Support in Mesos
OCI Support in MesosOCI Support in Mesos
OCI Support in Mesos
 
UEFI HTTP/HTTPS Boot
UEFI HTTP/HTTPS BootUEFI HTTP/HTTPS Boot
UEFI HTTP/HTTPS Boot
 
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
 
Container Security
Container SecurityContainer Security
Container Security
 
Rethinking the OS
Rethinking the OSRethinking the OS
Rethinking the OS
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

From Resilient to Antifragile Chaos Engineering Primer