SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
Why chaos must be practiced
^^ not my talk title!
Closing the Loop on Chaos
with Observability
^^ actual title
"Chaos" is a fancy marketing term for running
tests later in the software development lifecycle.
This is fine!
But we used to call it something else:
I blame this guy:
Testing in production has gotten a bad rap.
how they think we are
how we really are
Our idea of what the software development
lifecycle even looks like is overdue for an upgrade
in the era of distributed systems.
Deploying code is not a binary switch.
Deploying code is a process of increasing your
confidence in your code.
Development Production
deploy
Observability
Development Production
Observability
Development Production
why now?
“Complexity is increasing” - Science
Architectural complexity
Parse, 2015LAMP stack, 2005
monitoring => observability
known unknowns => unknown unknowns
LAMP stack => distributed systems
We are all distributed systems
engineers now
the unknowns outstrip the knowns
why does this matter more and more?
Distributed systems are particularly hostile to being
cloned or imitated (or monitored).
(clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of
almost-impossible failure scenarios that make staging
environments particularly worthless.
this is a black hole for engineering time
Operational literacy
Is not a nice-to-have
Without observability, you don't have "chaos
engineering". You just have chaos.
So what is observability?
Observability is NOT the same as monitoring.
@grepory, Monitorama 2016
“Monitoring is dead.”
“Monitoring systems have not changed significantly in 20 years and has fallen behind the way we
build software. Our software is now large distributed systems made up of many non-uniform
interacting components while the core functionality of monitoring systems has stagnated.”
@grepory, 2016
This is a outdated model for complex systems.
Observability
“In control theory, observability is a measure of how well internal
states of a system can be inferred from knowledge of its external
outputs. The observability and controllability of a system are
mathematical duals." — wikipedia
… translate??!?
Observability for software engineers:
Can you understand what’s happening inside your
systems, just by asking questions from the outside? Can
you debug your code and its behavior using its output?
Can you answer new questions without shipping new code?
You have an observable system
when your team can quickly and reliably track
down any new problem with no prior knowledge.
For software engineers, this means being able to
reason about your code, identify and fix bugs, and
understand user experiences and behaviors ...
via your instrumentation.
Monitoring
Represents the world from the perspective of a third party, and
describes the health of the system and/or its components in aggregate.
Observability
Describes the world from the first-person perspective of the software,
executing each request. Software explaining itself from the inside out.
We don’t *know* what the questions are, all
we have are unreliable symptoms or reports.
Complexity is exploding everywhere,
but our tools are designed for
a predictable world.
As soon as we know the question, we usually
know the answer too.
Welcome to distributed systems.
it’s probably fine.
(it might be fine?)
Many catastrophic states exist at any given time.
Your system is never entirely ‘up’
Distributed systems have an infinitely long list of
almost-impossible failure scenarios that make staging
environments particularly worthless.
this is a black hole for engineering time
You do it.
You have to do it.
Do it well.
Let’s try some examples!
Can you quickly and reliably track down problems like these?
The app tier capacity is exceeded. Maybe we
rolled out a build with a perf regression, or
maybe some app instances are down.
DB queries are slower than normal. Maybe
we deployed a bad new query, or there is lock
contention.
Errors or latency are high. We will look at
several dashboards that reflect common root
causes, and one of them will show us why.
“Photos are loading slowly for some people. Why?”
Monitoring
(LAMP stack)
monitor these things !
“Photos are loading slowly for some people. Why?”
(microservices)
Any microservices running on c2.4xlarge
instances and PIOPS storage in us-east-1b has a
1/20 chance of running on degraded hardware,
and will take 20x longer to complete for requests
that hit the disk with a blocking call. This
disproportionately impacts people looking at
older archives due to our fanout model.
Canadian users who are using the French
language pack on the iPad running iOS 9, are
hitting a firmware condition which makes it fail
saving to local cache … which is why it FEELS
like photos are loading slowly
Our newest SDK makes db queries
sequentially if the developer has enabled an
optional feature flag. Working as intended;
the reporters all had debug mode enabled.
But flag should be renamed for clarity sake.
wtf do i ‘monitor’ for?!
Monitoring?!?
Problems Symptoms
"I have twenty microservices and a sharded
db and three other data stores across three
regions, and everything seems to be getting a
little bit slower over the past two weeks but
nothing has changed that we know of, and
oddly, latency is usually back to the historical
norm on Tuesdays.
“All twenty app micro services have 10% of
available nodes enter a simultaneous crash
loop cycle, about five times a day, at
unpredictable intervals. They have nothing in
common afaik and it doesn’t seem to impact
the stateful services. It clears up before we
can debug it, every time.”
“Our users can compose their own queries that
we execute server-side, and we don’t surface it
to them when they are accidentally doing full
table scans or even multiple full table scans, so
they blame us.”
Observability
(microservices)
Still More Symptoms
“Several users in Romania and Eastern
Europe are complaining that all push
notifications have been down for them … for
days.”
“Disney is complaining that once in a while,
but not always, they don’t see the photo they
expected to see — they see someone else’s
photo! When they refresh, it’s fixed. Actually,
we’ve had a few other people report this too,
we just didn’t believe them.”
“Sometimes a bot takes off, or an app is
featured on the iTunes store, and it takes us a
long long time to track down which app or user
is generating disproportionate pressure on
shared components of our system (esp
databases). It’s different every time.”
Observability
“We run a platform, and it’s hard to
programmatically distinguish between
problems that users are inflicting themselves
and problems in our own code, since they all
manifest as the same errors or timeouts."
(microservices)
These are all unknown-unknowns
that may have never happened before, or ever happen again
(They are also the overwhelming majority of what you have
to care about for the rest of your life.)
Commandments
well-instrumented
high cardinality
high dimensionality
event-oriented perspective
structured data
software ownership
sampled
tested in prod.
Instrumentation
Get inside the software’s head
Explain it back to yourself/a naive user
The right level of abstraction is key
Wrap all network calls, etc
Open up all black boxes to inspection
Events, not metrics
Can you explain why literally any request has failed?
Cardinality
Context
Structured data
Tracing+events
First-person perspective
UUIDs
db raw queries
normalized queries
comments
firstname, lastname
PID/PPID
app ID
device ID
HTTP header type
build ID
IP:port
shopping cart ID
userid
... etc
Some of these …
might be …
useful …
YA THINK??!
High cardinality will save your ass.
High cardinality
You must be able to break down by 1/millions and
THEN by anything/everything else
High cardinality is not a nice-to-have
‘Platform problems’ are now everybody’s problems
Jumps to an answer, instead of starting with a question
You don’t know what you don’t know.
Very limited utility.
Dashboards
Artifacts of past failures.
Real time. Exploratory.
Aggregation is a one-way trip
Destroying raw events eliminates your ability to ask new questions.
Forever.
Aggregates are the kiss of death
Raw Requests
You can’t hunt needles if your tools don’t handle extreme outliers, aggregation
by arbitrary values in a high-cardinality dimension, super-wide rich context…
Black swans are the norm
you must care about max/min, 99%, 99.9th, 99.99th, 99.999th …
Software engineers spend too much time looking at code in elaborately
falsified environments, and not enough time observing it in the real world.
Tighten feedback loops. Give developers the
observability tooling they need to become fluent in
production and to debug their own systems.
We aren’t “writing code”.
We are “building systems”.
Watch it run in production.
Accept no substitute.
Get used to observing your systems when they AREN’T on fire.
Engineers
Real data
Real users
Real traffic
Real scale
Real concurrency
Real network
Real deploys
Real unpredictabilities.
well-instrumented
high cardinality
high dimensionality
event-driven
structured
well-owned
sampled
tested in prod.
Observability for the Future™
• It
Charity Majors
@mipsytipsy

Contenu connexe

Tendances

DevOps Transformations
DevOps TransformationsDevOps Transformations
DevOps TransformationsErnest Mueller
 
Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...
Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...
Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...Skelton Thatcher Consulting Ltd
 
Thinking Architecturally with Nate Schutta
Thinking Architecturally with Nate SchuttaThinking Architecturally with Nate Schutta
Thinking Architecturally with Nate SchuttaVMware Tanzu
 
Software Architecture Anti-Patterns
Software Architecture Anti-PatternsSoftware Architecture Anti-Patterns
Software Architecture Anti-PatternsEduards Sizovs
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks
 
Adaptable Designs for Agile Software Development
Adaptable Designs for Agile  Software DevelopmentAdaptable Designs for Agile  Software Development
Adaptable Designs for Agile Software DevelopmentHayim Makabee
 
Automated softwaretestingmagazine april2013
Automated softwaretestingmagazine april2013Automated softwaretestingmagazine april2013
Automated softwaretestingmagazine april2013drewz lin
 
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Rosenfeld Media
 
Resource Adaptive Systems
Resource Adaptive SystemsResource Adaptive Systems
Resource Adaptive SystemsTom Mueck
 
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCE
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCEDEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCE
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCEDrupalCamp Kyiv
 
Cloud-Based, Automated Mobile App Testing for the Enterprise
Cloud-Based, Automated Mobile App Testing for the EnterpriseCloud-Based, Automated Mobile App Testing for the Enterprise
Cloud-Based, Automated Mobile App Testing for the EnterpriseTechWell
 
No Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringNo Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringAditi Abhang
 
What are Model-Based Reviews
What are Model-Based ReviewsWhat are Model-Based Reviews
What are Model-Based ReviewsSarahCraig7
 
LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...
LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...
LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...Daniel Bryant
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackDevOpsDays Tel Aviv
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck VictorOps
 
OWASP San Antonio Meeting 10/2/20
OWASP San Antonio Meeting 10/2/20OWASP San Antonio Meeting 10/2/20
OWASP San Antonio Meeting 10/2/20Denim Group
 
DevOps: A Practical Guide
DevOps: A Practical GuideDevOps: A Practical Guide
DevOps: A Practical GuideVictorOps
 
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...Skelton Thatcher Consulting Ltd
 

Tendances (20)

DevOps Transformations
DevOps TransformationsDevOps Transformations
DevOps Transformations
 
Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...
Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...
Continuous Delivery antipatterns from the wild - Matthew Skelton - IPEXPO Man...
 
Thinking Architecturally with Nate Schutta
Thinking Architecturally with Nate SchuttaThinking Architecturally with Nate Schutta
Thinking Architecturally with Nate Schutta
 
Software Architecture Anti-Patterns
Software Architecture Anti-PatternsSoftware Architecture Anti-Patterns
Software Architecture Anti-Patterns
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
 
Adaptable Designs for Agile Software Development
Adaptable Designs for Agile  Software DevelopmentAdaptable Designs for Agile  Software Development
Adaptable Designs for Agile Software Development
 
Automated softwaretestingmagazine april2013
Automated softwaretestingmagazine april2013Automated softwaretestingmagazine april2013
Automated softwaretestingmagazine april2013
 
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
 
Resource Adaptive Systems
Resource Adaptive SystemsResource Adaptive Systems
Resource Adaptive Systems
 
Presentation2
Presentation2Presentation2
Presentation2
 
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCE
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCEDEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCE
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCE
 
Cloud-Based, Automated Mobile App Testing for the Enterprise
Cloud-Based, Automated Mobile App Testing for the EnterpriseCloud-Based, Automated Mobile App Testing for the Enterprise
Cloud-Based, Automated Mobile App Testing for the Enterprise
 
No Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringNo Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software Engineering
 
What are Model-Based Reviews
What are Model-Based ReviewsWhat are Model-Based Reviews
What are Model-Based Reviews
 
LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...
LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...
LMSUG 2015 "The Business Behind Microservices: Organisational, Architectural ...
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck
 
OWASP San Antonio Meeting 10/2/20
OWASP San Antonio Meeting 10/2/20OWASP San Antonio Meeting 10/2/20
OWASP San Antonio Meeting 10/2/20
 
DevOps: A Practical Guide
DevOps: A Practical GuideDevOps: A Practical Guide
DevOps: A Practical Guide
 
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...
 

Similaire à Chaos Engineering Without Observability ... Is Just Chaos

RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedis Labs
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️Ori Pekelman
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
From DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysOri Pekelman
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)ClubHack
 
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingDan Kaminsky
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
More fun using Kautilya
More fun using KautilyaMore fun using Kautilya
More fun using KautilyaNikhil Mittal
 
From SLO to GOTY
From SLO to GOTYFrom SLO to GOTY
From SLO to GOTYScyllaDB
 
How to fix bug or defects in software
How to fix bug or defects in software How to fix bug or defects in software
How to fix bug or defects in software Rajasekar Subramanian
 
A People's History of Microservices
A People's History of MicroservicesA People's History of Microservices
A People's History of MicroservicesCamille Fournier
 
Teensy Programming for Everyone
Teensy Programming for EveryoneTeensy Programming for Everyone
Teensy Programming for EveryoneNikhil Mittal
 
BSidesJXN 2017 - Improving Vulnerability Management
BSidesJXN 2017 - Improving Vulnerability ManagementBSidesJXN 2017 - Improving Vulnerability Management
BSidesJXN 2017 - Improving Vulnerability ManagementAndrew McNicol
 
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber SecurityDaniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber SecurityEnergySec
 
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docxCTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docxannettsparrow
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Squadcast Inc
 

Similaire à Chaos Engineering Without Observability ... Is Just Chaos (20)

RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
From DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed Apidays
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)
 
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
More fun using Kautilya
More fun using KautilyaMore fun using Kautilya
More fun using Kautilya
 
From SLO to GOTY
From SLO to GOTYFrom SLO to GOTY
From SLO to GOTY
 
How to fix bug or defects in software
How to fix bug or defects in software How to fix bug or defects in software
How to fix bug or defects in software
 
A People's History of Microservices
A People's History of MicroservicesA People's History of Microservices
A People's History of Microservices
 
Teensy Programming for Everyone
Teensy Programming for EveryoneTeensy Programming for Everyone
Teensy Programming for Everyone
 
BSidesJXN 2017 - Improving Vulnerability Management
BSidesJXN 2017 - Improving Vulnerability ManagementBSidesJXN 2017 - Improving Vulnerability Management
BSidesJXN 2017 - Improving Vulnerability Management
 
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber SecurityDaniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
 
Broken by design (Danny Fullerton)
Broken by design (Danny Fullerton)Broken by design (Danny Fullerton)
Broken by design (Danny Fullerton)
 
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docxCTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Chaos Engineering Without Observability ... Is Just Chaos

  • 1. Why chaos must be practiced ^^ not my talk title!
  • 2. Closing the Loop on Chaos with Observability ^^ actual title
  • 3. "Chaos" is a fancy marketing term for running tests later in the software development lifecycle. This is fine! But we used to call it something else:
  • 4.
  • 5. I blame this guy: Testing in production has gotten a bad rap.
  • 6.
  • 7. how they think we are how we really are
  • 8. Our idea of what the software development lifecycle even looks like is overdue for an upgrade in the era of distributed systems.
  • 9. Deploying code is not a binary switch. Deploying code is a process of increasing your confidence in your code.
  • 16. monitoring => observability known unknowns => unknown unknowns LAMP stack => distributed systems
  • 17. We are all distributed systems engineers now the unknowns outstrip the knowns why does this matter more and more?
  • 18. Distributed systems are particularly hostile to being cloned or imitated (or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
  • 19. Distributed systems have an infinitely long list of almost-impossible failure scenarios that make staging environments particularly worthless. this is a black hole for engineering time
  • 20. Operational literacy Is not a nice-to-have
  • 21. Without observability, you don't have "chaos engineering". You just have chaos. So what is observability?
  • 22. Observability is NOT the same as monitoring.
  • 23. @grepory, Monitorama 2016 “Monitoring is dead.” “Monitoring systems have not changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”
  • 24. @grepory, 2016 This is a outdated model for complex systems.
  • 25. Observability “In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!?
  • 26. Observability for software engineers: Can you understand what’s happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code?
  • 27. You have an observable system when your team can quickly and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and fix bugs, and understand user experiences and behaviors ... via your instrumentation.
  • 28. Monitoring Represents the world from the perspective of a third party, and describes the health of the system and/or its components in aggregate. Observability Describes the world from the first-person perspective of the software, executing each request. Software explaining itself from the inside out.
  • 29. We don’t *know* what the questions are, all we have are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.
  • 30. Welcome to distributed systems. it’s probably fine. (it might be fine?)
  • 31. Many catastrophic states exist at any given time. Your system is never entirely ‘up’
  • 32. Distributed systems have an infinitely long list of almost-impossible failure scenarios that make staging environments particularly worthless. this is a black hole for engineering time
  • 33. You do it. You have to do it. Do it well.
  • 34. Let’s try some examples! Can you quickly and reliably track down problems like these?
  • 35. The app tier capacity is exceeded. Maybe we rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reflect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” Monitoring (LAMP stack) monitor these things !
  • 36. “Photos are loading slowly for some people. Why?” (microservices) Any microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model. Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake. wtf do i ‘monitor’ for?! Monitoring?!?
  • 37. Problems Symptoms "I have twenty microservices and a sharded db and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays. “All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about five times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.” “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.” Observability (microservices)
  • 38. Still More Symptoms “Several users in Romania and Eastern Europe are complaining that all push notifications have been down for them … for days.” “Disney is complaining that once in a while, but not always, they don’t see the photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed. Actually, we’ve had a few other people report this too, we just didn’t believe them.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long long time to track down which app or user is generating disproportionate pressure on shared components of our system (esp databases). It’s different every time.” Observability “We run a platform, and it’s hard to programmatically distinguish between problems that users are inflicting themselves and problems in our own code, since they all manifest as the same errors or timeouts." (microservices)
  • 39. These are all unknown-unknowns that may have never happened before, or ever happen again (They are also the overwhelming majority of what you have to care about for the rest of your life.)
  • 40. Commandments well-instrumented high cardinality high dimensionality event-oriented perspective structured data software ownership sampled tested in prod.
  • 41. Instrumentation Get inside the software’s head Explain it back to yourself/a naive user The right level of abstraction is key Wrap all network calls, etc Open up all black boxes to inspection
  • 42. Events, not metrics Can you explain why literally any request has failed? Cardinality Context Structured data Tracing+events First-person perspective
  • 43. UUIDs db raw queries normalized queries comments firstname, lastname PID/PPID app ID device ID HTTP header type build ID IP:port shopping cart ID userid ... etc Some of these … might be … useful … YA THINK??! High cardinality will save your ass. High cardinality
  • 44. You must be able to break down by 1/millions and THEN by anything/everything else High cardinality is not a nice-to-have ‘Platform problems’ are now everybody’s problems
  • 45. Jumps to an answer, instead of starting with a question You don’t know what you don’t know. Very limited utility. Dashboards Artifacts of past failures. Real time. Exploratory.
  • 46. Aggregation is a one-way trip Destroying raw events eliminates your ability to ask new questions. Forever. Aggregates are the kiss of death
  • 47. Raw Requests You can’t hunt needles if your tools don’t handle extreme outliers, aggregation by arbitrary values in a high-cardinality dimension, super-wide rich context… Black swans are the norm you must care about max/min, 99%, 99.9th, 99.99th, 99.999th …
  • 48. Software engineers spend too much time looking at code in elaborately falsified environments, and not enough time observing it in the real world. Tighten feedback loops. Give developers the observability tooling they need to become fluent in production and to debug their own systems. We aren’t “writing code”. We are “building systems”.
  • 49. Watch it run in production. Accept no substitute. Get used to observing your systems when they AREN’T on fire. Engineers
  • 50. Real data Real users Real traffic Real scale Real concurrency Real network Real deploys Real unpredictabilities.
  • 51.