This document provides an overview of a session on security chaos engineering. The session will cover combating complexity in software, chaos engineering, resilience engineering and security, security chaos engineering, open source chaos tools, and a product demo from Verica.
The presenters from Verica will be Casey Rosenthal, CEO and founder, and Aaron Rinehart, CTO and founder. Casey Rosenthal helped create the discipline of chaos engineering at Netflix and built their chaos automation platform. Aaron Rinehart has experience leading security engineering strategies and pioneered the area of security chaos engineering.
Chaos engineering involves experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions. It is used to combat the increasing complexity
5. 5
@aaronrinehart @verica_io #chaosengineering
Casey Rosenthal, CEO, Founder
● Built and managed High Performance
teams (including Chaos Engineering
Team) at Netflix
● Known for creating the discipline of Chaos
Engineering
● Built the Chaos Automation Platform
(ChAP) at Netflix, the most sophisticated
implementation of advanced chaos
experimentation
● Written multiple books on Chaos
Engineering (O’Reilly)
Verica Team
6. 6
Aaron Rinehart, CTO, Founder
● Former Chief Security Architect
@UnitedHealth responsible for security
engineering strategy
● Led the DevOps and Open Source
Transformation at UnitedHealth Group
● Former (DOD, NASA, DHS, CollegeBoard )
● Frequent speaker and author on Chaos
Engineering & Security
● Pioneer behind Security Chaos Engineering
● Led ChaoSlingr team at UnitedHealth
@aaronrinehart @verica_io #chaosengineering
Verica Team
27. After a few
months….
Hard Coded Passwords
Identity Conflicts
Lead Software
Engineering finds a new
job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 850 Microservices
Cloud Provider API
Outage
WAF Outage -> DisabledScalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large Customer
Outage
Delayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1
Outage on Portal
Code Freeze
28. Years?….
Hard Coded Passwords
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 4000 Microservices
Cloud Provider API Outage
Firewall Outage -> Disabled
Scalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large Customer
Outage
Delayed Features
DNS Resolution
Errors
Expired Certificate
Regulatory
Audit
Rolling Sev1 Outages on
Portal
Code Freeze
Hard Coded Passwords
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 850 Microservices
Cloud Provider API Outage
WAF Outage -> DisabledScalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large CustomerDelayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1 Outage on
Portal
Merger with
competitor
Misconfigured FW Rule Outage
Database Outage
Portal Retry Storm
Outage
Orphaned Documentation
Corporate Reorg
Budget Freeze
Outsource overseas
development
Exposed Secrets on
GithuCode Freeze
b
Migration to New
CSP
Upgrade to Java
SE 12
53. “Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s ability to withstand turbulent
conditions”
Chaos
Engineering
57. “[Chaos Engineering is] empirical
rather than formal. We don’t use
models to understand what the
system should do. We run
experiments to learn what it does.”
- Michael T. Nygard
60. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Properties of a
Chaos Experiment
Game Days allow you to perform
experiments with maximum visibility
and coverage from component
owners, support teams and product.
● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
61. Developing a
Learning Culture
around Failure
● Safety as part of security
● Building safety margin
into systems
● Replace blame culture with
learning culture
● Telemetry, experimentation,
and instrumentation
62. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Engineering
Maturity
Despite what has been popularized on online
tech blogs you do not start off performing Chaos
Engineering on live production systems. There is
a maturity ramp to getting there.
● Validate Chaos Tools in
Lower Environment
● Develop Competency &
Confidence in Tooling
● Dry-run experiments
Warning: Still be careful in Non-Prod environments as you will be surprised what
hazards lie in Non-Prod. (Kafka Story)
63. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Monkey
Story
● During Business Hours
● Born out of Netflix Cloud
Transformation
● Put well defined problems
in front of engineers.
● Terminate VMs on
Random VPC Instances
64. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Engineering Pro-Tips
● Don’t perform an experiment
when you expect it to fail
● Auto Remediation of
Experiments will end in a
fiery Hell!
● Transparency is a Must
● Webcast & Record
GameDays
● The process of creating the
experiment and sharing the
learnings is the
highest-value of Chaos
Engineering
● Chaos Engineering Goal:
Share Team Mental Models
is of High Importance
Reference: Nora Jones 8 Traps of Chaos Engineering
65. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Pitfalls: Auto-Remediation
“…an operator will only be able to generate successful new
strategies for unusual situations if he has an adequate
knowledge of the process.”
“ Long term knowledge develops only through use and
feedback about its effectiveness.”
— Lisanne Bainbridge, The Ironies of Automation (1983)
Bring context or chase down
vulnerabilities for the service
owner instead of automating
fixes as this leads to a Fiery
Hell!
Reference: Nora Jones 8 Traps of Chaos Engineering
66. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Pitfalls:Breaking things on Purpose
“I'm pretty sure
I won’t have a job
very long if I
break things on
purpose all day.”
-Casey Rosenthal
The purpose of Chaos Engineering is NOT
to “Break Things on Purpose”.
If anything we are trying to “Fix them on
Purpose”!
Reference: Nora Jones 8 Traps of Chaos Engineering
67. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Engineering
Operational Models
● Organization-Wide Chaos Engineering
Team
● Provide a Chaos Engineering Solution for
Teams to Consume
● CentralTeam runs periodic Chaos
Experiments as a Service
● Provide SREs with Chaos Toolsets
“At Netflix Chaos Engineering
was always meant to be a
tools practice for SREs”
- Casey Rosenthal
68. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
GameDay Exercises
● 2-4 hrs in Length
● Diverse Cross Functional Group of
Engineers
● Focused on Increasing Resilience
● Used for Manual Chaos
Engineering
● Great Introduction to Chaos
Engineering
Recommendations
● Use GameDays for New Chaos
Experiments
● Use GameDays for Initial
Experiment Deployment on New
Targets
● Use GameDays for Proving New
Chaos Engineering Tools
● Get Everyone in the Same Location
69. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Experiment Lifecycle
1
Perform a GameDay
Exercise
Plan, Schedule, and Run a
GameDay Exercise for
New Experiments
Validate Experiment
Hypothesis
Goal: Validate
experiment ran
successfully and that
the results are credible.
2
Remediate Findings &
Repeat Experiment
If hypothesis failed for
the experiment. Develop
and remediate list of
findings. Once
remediated, repeat
experiment
3
Once Successful:
Automate Experiment
Once the experiment has
been proved to run
successfully validating
your hypothesis you can
now automate the
experiment runs
periodically..
4
70. GameDays: The Basics
Plan &
Organize
GameDay
Exercise
Execute
Live
GameDay
Operations
Automate &
Evangelize
Results & Take
Action
Chaos
Experiment
Develop &
Evaluate
Conduct
Pre-Incident
Review
72. “The discipline of instrumentation, identification,
and remediation of failure within security controls
through proactive experimentation to build
confidence in the system's ability to defend
against malicious conditions in production.”
Security Chaos Engineering is...
77. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Security Chaos Engineering: Is NOT
● Red Teaming
● Penetration Testing
● Adversary Based
● Focused on Attacks
● The process of creating the
experiment and sharing the
learnings is the
highest-value of Chaos
Engineering
● Chaos Engineering Goal:
Share Team Mental Models
is of High Importance
81. ● Validate Runbooks
● Measure Team Skills
● Determine Control
Effectiveness
● Learn new insights
● Transfer knowledge
● Build a learning culture
“Security Chaos Engineering
provides a new doorway for
security value but what I like
about it most is that it keeps
the incident response team
sharp.”
- CISO, Fortune 5 Healthcare Company
Applications
90. 90
Incidents and breaches
are hugely significant in
creating change post
facto. They tend to shape
the designs and
architecture of tomorrow's
solutions -@allspaw
Incidents Drive Design Changes
92. Stop looking for better
answers and start asking
better questions.
- John Allspaw
93. What is the system actually doing?
Has it done this before?
Why is it behaving that way?
What is it supposed to do next?
How did it get into this state?
99. Improve Value of
Security Log Data
● How valuable is your log
data?
● When do we ever assess
this?
● We dont know our logs
are shit until we
absolutely need them
● Proactively determine
quality of log data
around experiments
100. Verify Detection of
Disabled Log
Services “Security log pipelines are
essential to the success of an
information security program.”
-Prima Virani, Pinterest Security
103. More Experiment Examples
● Internet exposed
Kubernetes API
● Unauthorized Bad
Container Repo
● Unencrypted S3 Bucket
● Disable MFA
● Bad AWS Automated Block
Rule
● Software Secret Clear
Text Disclosure
● Permission collision in
Shared IAM Role Policy
● Disabled Service Event
Logging
● Introduce Latency on
Security Controls
● API Gateway Shutdown
106. How does Security Chaos Engineering
differ from Red Teaming, Purple
Teaming or Pen Testing?
Security
Crayons
107. ● Distributed Systems Focus
● Goal: Experimentation
● Human Factors focused
● Small Isolated Scope
● Focus on Cascading Events
● Performed by Mixed Engineering Teams
in Gameday
● During business hours
Differences in Scope, Focus, and Method
112. 112
Chaos
ROI
● Metrics & Measurements
● Business Outcome-Based KPIs
before Engineering Metrics
● Do not make the Case for the
Outage that Never Happenned
120. • ChatOps Integration
• Configuration-as-Code
• Example Code & Open Framework
ChaoSlingr Product Features
• Serverless App in AWS
• 100% Native AWS
• Configurable Operational Mode &
Frequency
• Opt-In | Opt-Out Model
121. Hypothesis: If someone accidentally or
maliciously introduced a misconfigured
port then we would immediately detect,
block, and alert on the event.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?
122. Result: Hypothesis disproved. Firewall did not detect
or block the change on all instances. Standard Port
AAA security policy out of sync on the Portal Team
instances. Port change did not trigger an alert and
log data indicated successful change audit.
However we unexpectedly learned the configuration
mgmt tool caught change and alerted the SoC.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?