DevOps Roadtrip Minneapolis

JASON HAND |
DevOps Evangelist
• Holds over 15 years of experience as a
developer, system administrator, and
support specialist
• Fully emerged into the world of agile
development and the DevOps
movement with Colorado tech
startups
#DevOpsRoadTrip

#DevOpsRoadtrip
#DevOpsRoadTrip

A little about VictorOps…
VictorOps is the real-time incident
management platform that combines the
power of people and data to embolden
DevOps pros to handle incidents as they
occur.
#DevOpsRoadTrip

“How Organizations Process Information”
Roy Westrum: A Typology of Organizational Cultures
2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of
profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how
effectively organizations process information, as determined by a model created by sociologist Ron
Westrum, shown below. 1
1: https://continuousdelivery.com/implementing/culture/

Words are how we think – stories are how we link.
- Christina Baldwin
Oral narrative is and for a long time has been the
chief basis of culture itself.
- John D. Niles
Stories from the road

Unordered Ordered
Complicated
Obvious
Complex
Chaotic
Cause Effect
Obvious
From Experience
Cause Effect
Requires
Analysis
Cause Effect Only
Apparent in Hindsight
Cause & Effect Cannot
Be Related
Sense – Categorize - Respond
Sense – Analyze - Respond
Probe – Sense - Respond
Act – Sense - Respond

The systems we engineer, maintain, and improve are
Complicated
.. or ..
Known
unknowns

ComplexUnknown
unknowns

What are the..
Contributing
Factors?

Identifying a “root cause” helps us to …
Put it back
how it was

What we really want is to..
Continuously
Improve

TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive
(chaotic)
Tactical
(obvious)
Integrated
(complicated)
Strategic
(complex)
✓ No automation
✓ No operational stack
awareness
✓ Poor collaboration between
teams (Dev & Ops)
✓ Documentation not available
✓ No standardized
communication
✓ High focus on consistent
continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting
instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes
are available
✓ Understood crisis
communication protocols
✓ Remediation data available to
IT Operations
✓ Team rotations, paging
policies, role hunting
✓ Continuous improvement of
key health indicators
✓ Technical collaboration across
all incidents
✓ Docs up to date and easily
accessible
✓ Consistent real-time
communication practices
✓ Automated docs and remediation
✓ Actionable Alerts with full context
✓ High collaboration among all
teams
✓ Documentation part of
remediation
✓ Targeted, proactive crisis comms
✓ High focus on continuous learning
Incident Management
Maturity

Reactive
(chaotic)
✓No automation
✓No operational stack awareness
✓Poor collaboration between teams (Dev & Ops)
✓Documentation not available
✓No standardized communication
✓High focus on consistent continuous learning

Tactical
(obvious)
✓Uses a NOC
✓Some monitoring & alerting instrumentation
✓Collaboration in crisis
✓"Mission critical" processes are available
✓Understood crisis communication protocols
✓Remediation data available to IT Operations

Integrated
(complicated)
✓Team rotations, paging policies, role hunting
✓Continuous improvement of key health indicators
✓Technical collaboration across all incidents
✓Docs up to date and easily accessible
✓Consistent real-time communication practices

Strategic
(complex)
✓Automated docs and remediation
✓Actionable Alerts with full context
✓High collaboration among all teams
✓Documentation part of remediation
✓Targeted, proactive crisis comms
✓High focus on continuous learning

“Six Trends Shape DevOps Adoption, Q1 2015”
Forrester report
• The Foundation For Success Is In Place . . . Mostly
• Fear Of Failure Will Hamper Advancement
• Monitoring And Analytics Strategies Must Make A Big Leap Forward
• The Focus On Customer Experience Is Not Second Nature . . . Yet
• Change And Release Processes Are Not Delivering Business Needs
• You Must Prioritize And Focus Sourcing Strategies

Automation
Awareness
Collaboration
Documentation User Empathy
Learning

Failure not seen as opportunity to learn
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

Awareness
http://blog.vmware.com

© 2015 Forrester Research, Inc. Reproduction Prohibited 46
Single Source Of Truth Lacking In Many
Orgs – 95% only most of the time or less
Source: April 15, 2015 “Six Trends That Will Shape DevOps Adoption”, Forrester report

Teams siloed throughout life cycle

User Empathy
https://open.buffer.com/wp-content/uploads/2015/12/empathy3.jpg

IT teams aren’t measured on customer
experience goals.

Automation
http://thelifedesignproject.com/wp-content/uploads/2009/09/373881476_217d24ef6d.jpg

Delays in notifications Leads To Customers
Finding the Problem First

Documentation
http://blog.vmware.com

Reduce MTTR
State of DevOps Report (2015)
– by Puppet Labs

Automation
Awareness
Collaboratio
n
Documentation User
Learning

Bridget Kromhout | Pivotal - Cloud Foundry
Principal Technologist
• Bridget Kromhout is a Principal Technologist for Cloud Foundry at
Pivotal.
• After years as an operations engineer (most recently at DramaFever),
she traded in oncall for more travel.
• A frequent speaker at tech conferences, she helps organize tech
meetups at home in Minneapolis, serves on the program committee for
Velocity, and acts as a global core organizer for devopsdays.
• She podcasts at Arrested DevOps, occasionally blogs at
bridgetkromhout.com, and is active in a Twitterverse near you.
#DevOpsRoadTrip

@bridgetkromhout
lives:
Minneapolis,
Minnesota
works:
Pivotal
podcasts:
Arrested
DevOps
organizes:
devopsdays
Bridget Kromhout

@bridgetkromhout
Traded oncall… …for more travel (Similar effect on sleep)

@bridgetkromhout
“…measuring value, throughput,
and performance…
revenue rather than cost”
The Art of Monitoring (2016)
James Turnbull
artofmonitoring.com

@bridgetkromhout
Image credit: James Ernest

@bridgetkromhout
The Art of Monitoring (2016)
James Turnbull
Monitoring containers
artofmonitoring.com

@bridgetkromhout
“Almost every task run
under Borg contains a
built-in HTTP server that
publishes information
about the health of the
task and thousands of
performance metrics”
Large-scale cluster management at Google with Borg - Verma et al. 2015
“Almost every task run
under Borg contains a
built-in HTTP server that
publishes information
about the health of the
task and thousands of
performance metrics”

@bridgetkromhout
The Art of Monitoring (2016) — James Turnbull
Monitoring Maturity Model
artofmonitoring.com

@bridgetkromhout Image credit: Wikipedia
“Any organization that designs a system…
will produce a design
whose structure is a copy of
the organization's
communication
structure.”
Mel Conway

@bridgetkromhout
silos are for grain

@bridgetkromhout
three Friday mornings in Minneapolis
removed restored

Andy Domeier | SPS Commerce
Director System Operations
• Andy has been in Technology Operations leadership with SPS
Commerce for the past 11 years.
• Andy spends many mental cycles collaborating to solve
effective patterns for monitoring and operating complex
changing systems.
• Andy’s also spends time solving for priority organization and
alignment and the organization of knowledge.
#DevOpsRoadTrip

HOW EFFECTIVE IS
YOUR INCIDENT
RESPONSE?Andy Domeier
@ajdomie

agenda
© SPS COMMERCE 2
Styles of Incident Response
Healthy Incident Response
Tips & Tricks

STYLE #1 - DENIAL
© SPS COMMERCE 3
That’s not possible!
No Wai!

STYLE #2 - CONFUSED
© SPS COMMERCE 5
Ummmm
Hmmmm
(crickets)
How is this
Possible?

STYLE #3 - LAZY
© SPS COMMERCE 7
It’s the Database
It’s the Network
Just Restart It

STYLE #4 - ANGRY
© SPS COMMERCE 9
Why did
you do that?
What did you
change?
#%$! &#!^ #$@

STYLE #5 - FIREDRILL
© SPS COMMERCE 11
OMG W
TF
FML
“Buckshot”

© SPS COMMERCE 13
LET’S GET REAL

• Good way - Alarm
HOW DO WE KNOW THERE IS A FIRE?
© SPS COMMERCE 15

• Bad Way – Humans
HOW DO WE KNOW THERE IS A FIRE?
© SPS COMMERCE 16

• If you catch it right away?
WHO PUTS THE FIRE OUT?
© SPS COMMERCE 17

• If it’s out of control?
WHO PUTS THE FIRE OUT?
© SPS COMMERCE 18

INCIDENT RESPONSE TEAM
© SPS COMMERCE 19

• #monoliths
– Familiar, All or None, Less Agility
• #microservices
– Complex, semi-isolated, Agile
WHAT’S YOUR SYSTEM?
© SPS COMMERCE 20

• Monitoring Tools
– Base IT
– Logging
– APM
– Metrics
WHERE’S YOUR DATA?
© SPS COMMERCE 21

RESPOND IN ISOLATION
© SPS COMMERCE 22

• Hey Danielle, It looks like the site is acting up and when looking around the only outlier
I have found so far is a cpu spike on the DB. Can you help me investigate this a bit
more?
RESPOND AS A TEAM
© SPS COMMERCE 23

• Share Screens & Visualize Data
• Display Alerts w/ Integrations
• Automatic History Retention
• Enables Collaboration for All
• And my Favorite…...
#CHATOPS
© SPS COMMERCE 24

• Make health data as transparent and central as possible
– Helps the Team “Know where the fire is”
• Share data in chat
– Use the metric from your tools
• “Be Transparent”
• Team Response Nurtures Team Follow Up
TIPS FOR HEALTHY INCIDENT
RESPONSE
© SPS COMMERCE 26

• Always tie things back to the customer
– Simple but often over looked
– Opportunity to link the team to the business
TIPS FOR HEALTHY INCIDENT
RESPONSE
© SPS COMMERCE 27

Ben Overmyer | Star Tribune
Digital Manager, Operations
• Ben is the Digital Manager of Operations at the Minneapolis Star
Tribune.
• He has over a decade of experience as a back end software engineer,
two years of experience as a dedicated operations engineer, and great
enthusiasm for the DevOps culture.
• Besides the Star Tribune, he’s worked for an eclectic mix of
organizations, including the USGS, a game company in New Zealand,
and a beauty products marketing company.
• When not hacking on servers, apps, or people, he acts as art director
and author for a tabletop gaming company.
#DevOpsRoadTrip

EVOLVING INCIDENT
MANAGEMENT
STAR TRIBUNE DEVOPS

IN THE BEGINNING
▸ Forwarded phone line
▸ An on-call list maintained in a wiki
▸ Every week, manually change to the next person on the list
▸ …and overrides or substitutions?

EARLY MONITORING
▸ Zabbix monitoring set up for a handful of causes
▸ Zabbix alerts sent via email to a distribution list
▸ Sometimes no one would see these alerts until hours or, in
rare cases, days later

THE PAIN POINTS
▸ Manual maintenance of the calling tree data
▸ Manual rotation of the support phone line forwarding
▸ Poor documentation of incident life cycles
▸ No sense of incident frequency beyond “this was a bad
couple weeks”
▸ If the on-call person didn’t respond, there was no
escalation process other than calling the head of Digital

ADOPTING VICTOROPS
▸ Automated rotations
▸ Multiple teams
▸ Automatic escalation processes
▸ Easy schedule overrides and changes
▸ APIs for programmatic incident interaction

THE NATURE OF ALERTS
▸ OK, we can set up programmatic alerts. Now what?
▸ Integrating Zabbix, New Relic, and CloudWatch
▸ Discovering alert ﬂoods
▸ Move to alerting on symptoms, not causes
▸ …but still monitoring causes

THE SPIDEY-SENSE FACTOR
▸ Humans are good at catching certain kinds of problems
▸ “This doesn’t feel right” and gaps in monitoring
▸ The evolution of the Sev incident system

THE STATUS SITE: MANUAL ALERTING FOR NON-TECH USERS
▸ Want to let certain non-tech users report Sev incidents
▸ Initially just a password-protected form
▸ Uses the VictorOps alert ingestion API for triggering alerts
▸ Uses the VictorOps public API for fetching information
▸ Each Sev alert is created with its own entity_id
▸ Lets admin users share status updates

MONTHLY INCIDENT REPORTING
▸ Monthly reports include a list of all Sev incidents, when
they started, when they ended, what the alert text was, and
what the resolution was
▸ Combine automated and chat messages in VictorOps with
data gathered from other sources
▸ Present this data as automatically as possible in the Status
Site

NEXT STEPS
▸ Integration of summarized data collected from Datadog/
CloudWatch/etc. into incident reporting
▸ Reports for users that shouldn’t have access to VictorOps
▸ Integration of the Status Site into Slack

▸ @bovermyer
▸ benovermyer.com

Breakout Sessions
◻ ChatOps - Jason Hand
◻ Leveraging Data to Establish a Healthy Culture - Andy Domeier
◻ Monitoring and Microservices – Bridget Kromhout
◻ Blameless Culture – Heather Mickman
◻ Devs vs. Ops On-Call, How and Why to Get started – Ben Overmyer
#DevOpsRoadTrip

Heather Mickman | Target
Senior Director of Platform Engineering
• Heather Mickman is the Senior Director of Platform Engineering at Target and a
DevOps enthusiast.
• Heather has 20+ years of IT experience in various roles and industries including
retail, transportation, and high tech manufacturing.
• She is currently working on building the platforms used by software engineers
at Target including a multi-provider cloud platform, API Gateway, telemetry
tooling, data stores, and messaging.
• She has a passion for technology, building high performing teams, driving a
culture of innovation, and having fun along the way. Heather lives in
Minneapolis with her 2 sons and mini dachshund.
#DevOpsRoadTrip

Unordered Ordered
Complicated
Obvious
Complex
Chaotic
Cause Effect Obvious
From Experience
Cause Effect Requires
Analysis
Cause Effect Only
Apparent in Hindsight
Cause & Effect Cannot
Be Related
Sense – Categorize - Respond
Sense – Analyze - RespondProbe – Sense - Respond
Act – Sense - Respond

Complicated
.. or ..
Known unknowns

ComplexUnknown unknowns

Single Source Of Truth Lacking In Many
Orgs – 95% only most of the time or less
Source: April 15, 2015 “Six Trends That Will Shape DevOps Adoption”, Forrester report

IT teams aren’t measured on customer
experience goals.

TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive (0 – 4)
(chaotic)
Tactical (5 – 9)
(obvious)
Integrated (10 -14)
(complicated)
Strategic (15 – 18)
(complex)
✓ No automation
✓ No operational stack
awareness
✓ Poor collaboration between
teams (Dev & Ops)
✓ Documentation not available
✓ No standardized
communication
✓ High focus on consistent
continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting
instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes are
available
✓ Understood crisis
communication protocols
✓ Remediation data available to
IT Operations
✓ Team rotations, paging
policies, role hunting
✓ Continuous improvement of
key health indicators
✓ Technical collaboration across
all incidents
✓ Docs up to date and easily
accessible
✓ Consistent real-time
communication practices
✓ Automated docs and remediation
✓ Actionable Alerts with full context
✓ High collaboration among all teams
✓ Documentation part of remediation
✓ Targeted, proactive crisis comms
✓ High focus on continuous learning
Incident Management
Maturity

DENVER - SEATTLE - SAN FRANCISCO - MINNEAPOLIS - NEW YORK CITY

DevOps Roadtrip Minneapolis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à DevOps Roadtrip Minneapolis

Similaire à DevOps Roadtrip Minneapolis (20)

Dernier

Dernier (20)

DevOps Roadtrip Minneapolis