We’ve hit the road and rounded up local industry leaders Heather Mickman, Bridget Kromhout, Andy Domeier, and Ben Overmyer for this incredible half-day event. These speakers, from Target, Pivotal, Minneapolis Star Tribune, and SPS Commerce, shared real-life DevOps implementation stories and suggestions to help you on your DevOps journey.
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
DevOps Roadtrip Minneapolis
1.
2. JASON HAND |
DevOps Evangelist
• Holds over 15 years of experience as a
developer, system administrator, and
support specialist
• Fully emerged into the world of agile
development and the DevOps
movement with Colorado tech
startups
#DevOpsRoadTrip
5. A little about VictorOps…
VictorOps is the real-time incident
management platform that combines the
power of people and data to embolden
DevOps pros to handle incidents as they
occur.
#DevOpsRoadTrip
19. “How Organizations Process Information”
Roy Westrum: A Typology of Organizational Cultures
2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of
profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how
effectively organizations process information, as determined by a model created by sociologist Ron
Westrum, shown below. 1
1: https://continuousdelivery.com/implementing/culture/
20.
21.
22. Words are how we think – stories are how we link.
- Christina Baldwin
Oral narrative is and for a long time has been the
chief basis of culture itself.
- John D. Niles
Stories from the road
36. TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive
(chaotic)
Tactical
(obvious)
Integrated
(complicated)
Strategic
(complex)
✓ No automation
✓ No operational stack
awareness
✓ Poor collaboration between
teams (Dev & Ops)
✓ Documentation not available
✓ No standardized
communication
✓ High focus on consistent
continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting
instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes
are available
✓ Understood crisis
communication protocols
✓ Remediation data available to
IT Operations
✓ Team rotations, paging
policies, role hunting
✓ Continuous improvement of
key health indicators
✓ Technical collaboration across
all incidents
✓ Docs up to date and easily
accessible
✓ Consistent real-time
communication practices
✓ Automated docs and remediation
✓ Actionable Alerts with full context
✓ High collaboration among all
teams
✓ Documentation part of
remediation
✓ Targeted, proactive crisis comms
✓ High focus on continuous learning
Incident Management
Maturity
37. Reactive
(chaotic)
✓No automation
✓No operational stack awareness
✓Poor collaboration between teams (Dev & Ops)
✓Documentation not available
✓No standardized communication
✓High focus on consistent continuous learning
38. Tactical
(obvious)
✓Uses a NOC
✓Some monitoring & alerting instrumentation
✓Collaboration in crisis
✓"Mission critical" processes are available
✓Understood crisis communication protocols
✓Remediation data available to IT Operations
39. Integrated
(complicated)
✓Team rotations, paging policies, role hunting
✓Continuous improvement of key health indicators
✓Technical collaboration across all incidents
✓Docs up to date and easily accessible
✓Consistent real-time communication practices
40. Strategic
(complex)
✓Automated docs and remediation
✓Actionable Alerts with full context
✓High collaboration among all teams
✓Documentation part of remediation
✓Targeted, proactive crisis comms
✓High focus on continuous learning
41. “Six Trends Shape DevOps Adoption, Q1 2015”
Forrester report
• The Foundation For Success Is In Place . . . Mostly
• Fear Of Failure Will Hamper Advancement
• Monitoring And Analytics Strategies Must Make A Big Leap Forward
• The Focus On Customer Experience Is Not Second Nature . . . Yet
• Change And Release Processes Are Not Delivering Business Needs
• You Must Prioritize And Focus Sourcing Strategies
64. Bridget Kromhout | Pivotal - Cloud Foundry
Principal Technologist
• Bridget Kromhout is a Principal Technologist for Cloud Foundry at
Pivotal.
• After years as an operations engineer (most recently at DramaFever),
she traded in oncall for more travel.
• A frequent speaker at tech conferences, she helps organize tech
meetups at home in Minneapolis, serves on the program committee for
Velocity, and acts as a global core organizer for devopsdays.
• She podcasts at Arrested DevOps, occasionally blogs at
bridgetkromhout.com, and is active in a Twitterverse near you.
#DevOpsRoadTrip
72. @bridgetkromhout
“Almost every task run
under Borg contains a
built-in HTTP server that
publishes information
about the health of the
task and thousands of
performance metrics”
Large-scale cluster management at Google with Borg - Verma et al. 2015
“Almost every task run
under Borg contains a
built-in HTTP server that
publishes information
about the health of the
task and thousands of
performance metrics”
73. @bridgetkromhout
The Art of Monitoring (2016) — James Turnbull
Monitoring Maturity Model
artofmonitoring.com
74. @bridgetkromhout Image credit: Wikipedia
“Any organization that designs a system…
will produce a design
whose structure is a copy of
the organization's
communication
structure.”
Mel Conway
79. Andy Domeier | SPS Commerce
Director System Operations
• Andy has been in Technology Operations leadership with SPS
Commerce for the past 11 years.
• Andy spends many mental cycles collaborating to solve
effective patterns for monitoring and operating complex
changing systems.
• Andy’s also spends time solving for priority organization and
alignment and the organization of knowledge.
#DevOpsRoadTrip
109. Ben Overmyer | Star Tribune
Digital Manager, Operations
• Ben is the Digital Manager of Operations at the Minneapolis Star
Tribune.
• He has over a decade of experience as a back end software engineer,
two years of experience as a dedicated operations engineer, and great
enthusiasm for the DevOps culture.
• Besides the Star Tribune, he’s worked for an eclectic mix of
organizations, including the USGS, a game company in New Zealand,
and a beauty products marketing company.
• When not hacking on servers, apps, or people, he acts as art director
and author for a tabletop gaming company.
#DevOpsRoadTrip
111. IN THE BEGINNING
▸ Forwarded phone line
▸ An on-call list maintained in a wiki
▸ Every week, manually change to the next person on the list
▸ …and overrides or substitutions?
112. EARLY MONITORING
▸ Zabbix monitoring set up for a handful of causes
▸ Zabbix alerts sent via email to a distribution list
▸ Sometimes no one would see these alerts until hours or, in
rare cases, days later
113. THE PAIN POINTS
▸ Manual maintenance of the calling tree data
▸ Manual rotation of the support phone line forwarding
▸ Poor documentation of incident life cycles
▸ No sense of incident frequency beyond “this was a bad
couple weeks”
▸ If the on-call person didn’t respond, there was no
escalation process other than calling the head of Digital
115. ADOPTING VICTOROPS
▸ Automated rotations
▸ Multiple teams
▸ Automatic escalation processes
▸ Easy schedule overrides and changes
▸ APIs for programmatic incident interaction
116. THE NATURE OF ALERTS
▸ OK, we can set up programmatic alerts. Now what?
▸ Integrating Zabbix, New Relic, and CloudWatch
▸ Discovering alert floods
▸ Move to alerting on symptoms, not causes
▸ …but still monitoring causes
118. THE SPIDEY-SENSE FACTOR
▸ Humans are good at catching certain kinds of problems
▸ “This doesn’t feel right” and gaps in monitoring
▸ The evolution of the Sev incident system
119. THE STATUS SITE: MANUAL ALERTING FOR NON-TECH USERS
▸ Want to let certain non-tech users report Sev incidents
▸ Initially just a password-protected form
▸ Uses the VictorOps alert ingestion API for triggering alerts
▸ Uses the VictorOps public API for fetching information
▸ Each Sev alert is created with its own entity_id
▸ Lets admin users share status updates
120. MONTHLY INCIDENT REPORTING
▸ Monthly reports include a list of all Sev incidents, when
they started, when they ended, what the alert text was, and
what the resolution was
▸ Combine automated and chat messages in VictorOps with
data gathered from other sources
▸ Present this data as automatically as possible in the Status
Site
122. NEXT STEPS
▸ Integration of summarized data collected from Datadog/
CloudWatch/etc. into incident reporting
▸ Reports for users that shouldn’t have access to VictorOps
▸ Integration of the Status Site into Slack
126. Breakout Sessions
◻ ChatOps - Jason Hand
◻ Leveraging Data to Establish a Healthy Culture - Andy Domeier
◻ Monitoring and Microservices – Bridget Kromhout
◻ Blameless Culture – Heather Mickman
◻ Devs vs. Ops On-Call, How and Why to Get started – Ben Overmyer
#DevOpsRoadTrip
128. Breakout Sessions
◻ ChatOps - Jason Hand
◻ Leveraging Data to Establish a Healthy Culture - Andy Domeier
◻ Monitoring and Microservices – Bridget Kromhout
◻ Blameless Culture – Heather Mickman
◻ Devs vs. Ops On-Call, How and Why to Get started – Ben Overmyer
#DevOpsRoadTrip
130. Heather Mickman | Target
Senior Director of Platform Engineering
• Heather Mickman is the Senior Director of Platform Engineering at Target and a
DevOps enthusiast.
• Heather has 20+ years of IT experience in various roles and industries including
retail, transportation, and high tech manufacturing.
• She is currently working on building the platforms used by software engineers
at Target including a multi-provider cloud platform, API Gateway, telemetry
tooling, data stores, and messaging.
• She has a passion for technology, building high performing teams, driving a
culture of innovation, and having fun along the way. Heather lives in
Minneapolis with her 2 sons and mini dachshund.
#DevOpsRoadTrip
153. Unordered Ordered
Complicated
Obvious
Complex
Chaotic
Cause Effect Obvious
From Experience
Cause Effect Requires
Analysis
Cause Effect Only
Apparent in Hindsight
Cause & Effect Cannot
Be Related
Sense – Categorize - Respond
Sense – Analyze - RespondProbe – Sense - Respond
Act – Sense - Respond
154.
155. The systems we engineer, maintain, and improve are
Complicated
.. or ..
Known unknowns
156. The systems we engineer, maintain, and improve are
ComplexUnknown unknowns
162. TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive
(chaotic)
Tactical
(obvious)
Integrated
(complicated)
Strategic
(complex)
✓ No automation
✓ No operational stack
awareness
✓ Poor collaboration between
teams (Dev & Ops)
✓ Documentation not available
✓ No standardized
communication
✓ High focus on consistent
continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting
instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes
are available
✓ Understood crisis
communication protocols
✓ Remediation data available to
IT Operations
✓ Team rotations, paging
policies, role hunting
✓ Continuous improvement of
key health indicators
✓ Technical collaboration across
all incidents
✓ Docs up to date and easily
accessible
✓ Consistent real-time
communication practices
✓ Automated docs and remediation
✓ Actionable Alerts with full context
✓ High collaboration among all
teams
✓ Documentation part of
remediation
✓ Targeted, proactive crisis comms
✓ High focus on continuous learning
Incident Management
Maturity
163. Reactive
(chaotic)
✓No automation
✓No operational stack awareness
✓Poor collaboration between teams (Dev & Ops)
✓Documentation not available
✓No standardized communication
✓High focus on consistent continuous learning
164. Tactical
(obvious)
✓Uses a NOC
✓Some monitoring & alerting instrumentation
✓Collaboration in crisis
✓"Mission critical" processes are available
✓Understood crisis communication protocols
✓Remediation data available to IT Operations
165. Integrated
(complicated)
✓Team rotations, paging policies, role hunting
✓Continuous improvement of key health indicators
✓Technical collaboration across all incidents
✓Docs up to date and easily accessible
✓Consistent real-time communication practices
166. Strategic
(complex)
✓Automated docs and remediation
✓Actionable Alerts with full context
✓High collaboration among all teams
✓Documentation part of remediation
✓Targeted, proactive crisis comms
✓High focus on continuous learning
182. TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive (0 – 4)
(chaotic)
Tactical (5 – 9)
(obvious)
Integrated (10 -14)
(complicated)
Strategic (15 – 18)
(complex)
✓ No automation
✓ No operational stack
awareness
✓ Poor collaboration between
teams (Dev & Ops)
✓ Documentation not available
✓ No standardized
communication
✓ High focus on consistent
continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting
instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes are
available
✓ Understood crisis
communication protocols
✓ Remediation data available to
IT Operations
✓ Team rotations, paging
policies, role hunting
✓ Continuous improvement of
key health indicators
✓ Technical collaboration across
all incidents
✓ Docs up to date and easily
accessible
✓ Consistent real-time
communication practices
✓ Automated docs and remediation
✓ Actionable Alerts with full context
✓ High collaboration among all teams
✓ Documentation part of remediation
✓ Targeted, proactive crisis comms
✓ High focus on continuous learning
Incident Management
Maturity