2. Mr. White has fifteen years of experience designing and managing the
deployment of Systems Monitoring and Event Management software. Prior
to joining IBM, Mr. White held various positions including the leader of the
Monitoring and Event Management organization of a Fortune 100 company
and developing solutions as a consultant for a wide variety of organizations,
including the Mexican Secretaría de Hacienda y Crédito Público, Telmex,
Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US
Navy Facilities and Engineering Command.
Andrew White
Cloud and Smarter Infrastructure Solution Specialist
IBM Corporation
4. Ground rules for this
session…
• If you can’t tell if I am trying to be funny…
–
GO AHEAD AND LAUGH!
• Feel free to text, tweet, yammer, or whatever
to share with the rest of the attendees
• If you have a question, no need to wait until
the end. Just interrupt me. Seriously… I
don’t mind.
5. I am here today to share some of what I have learned about
6. We (IT) sells promises…
The value of these promises depends on the
customer’s perception that we are willing and
capable of making good on the promise when
the time comes. This perception is affected by
the interactions they have with us.
8. Anatomy of an Outage
Corporate!
LANs & VPNs!
Load Balancer!
Firewall!
Web!
Servers!
Message!
Queue!
zOS!
CICS!
WAS!
Database!
WAS!
Database!
zOS!
MQ!
DB2!
!
!
!
!
4!
!
!
!
!
!
!
3!
!
!
!
!
!
!
1!
5:45-ish pm: CICS ABENDS
start flooding the console but
not high enough to ticket!
!
!
!
!
!
!
2!
6:00-ish pm: MQ flows start are interrupted
and are alerting in Flow Diagnostics!
6:04pm: Synthetic transactions fail at
and 6:14 the Ops Center confirms the
issue and creates a P0 Incident!
6:54pm: Support teams
investigate the interrupted
flows and determine it is a
“back-end” problem!
10:29pm: Support teams
investigate MQ and ultimately
and rule it out and ultimately
decide to reset CICS to resolve
the issue!
!
!
!
!
5!
11. Why is problem solving hard?
• commencement opacity
• continuation opacity
Non-transparency (lack of
clarity of the situation)
• inexpressiveness
• opposition
• transience
Polytely (multiple goals)
• enumerability
• connectivity (hierarchy relation, communication relation, allocation
relation)
• heterogeneity
Complexity (large numbers
of items, interrelations,
and decisions)
• temporal constraints
• temporal sensitivity
• phase effects
• dynamic unpredictability
Dynamics (time
considerations)
12. Boyd’s Loop
Observation
Outside
Information
Implicit Guidance & Control
Unfolding Interaction With Environment
Feedback
Feedback
Unfolding
Circumstances
Cultural
Norms
Cognitive
Abilities
Knowledge
Life Cycle
Prior
Wisdom
New
Information
Feed
Forward
Decision
(Hypothesis)
Feed
Forward
Action
(Test)
Feed
Forward
• Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the
feedback and other phenomena coming into our sensing or observing window.
• Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing
process of projection, empathy, correlation, and rejection.
From “The Essence of Winning and Losing,” John R. Boyd, January 1996.
Observe Orient Decide Act
13. Where the Breakdown
Occurs
Observe! Orient! Decide! Act!
Situational Awareness!
Perception of
Elements in
Current Situation!
!
Level 1!
Comprehension
of Current
Situation!
!
Level 2!
Projection of
Future Status!
!
!
Level 3!
Decision!
Performance
of Actions!
CurrentState!
Feedback!
• Goals & Objectives!
• Preconceptions!
• Expectations!
• Abilities!
• Experience!
• Training!
Long Term
Memory!
Automaticity!
Cognitive Processes!
• System Capability!
• Interface Design!
• Stress & Workload!
• Complexity!
• Automation!
Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness
in dynamic systems. Human Factors 37(1), 32–64.!
Systemic Influences!
Individual Influences!
14. Incident Life Cycle
Down Time
Detection Time
Response Time
Repair Time
Recovery Time
Outage
Detection
Diagnosis
Repair
Recover
Restore
Observe
Orient
Decide
Act
16. Point of
Observation
Past Behavior
• The observation period
used to feed the
forecasting models
Future Behavior
• The performance
period the model is
trying to predict
Predictive Modeling Timeline
17. Predictive models
harness the information
lost in past data so you
can identify discretely
identify situations and
react to them quickly.
18. What Matters Most?
Dr.
Lee
Goldman
Cook
County
Hospital,
Chicago,
IL
§ Is the patient feeling unstable
angina?
§ Is there fluid in the patient’s lungs?
§ Is the patient’s systolic blood
pressure below 100?"
The Goldman Algorithm
Prediction of Patients Expected to
Have a Heart Attack Within 72 Hours
0
20
40
60
80
100
Traditional Techniques
Goldman Algorithm
By paying attention to what really matters, Dr.
Goldman improved the “false negatives” by 20
percentage points and eliminated the “false
positives” altogether.
19. The Goldman Algorithm
ECG Evidence of Acute Ischemia?
ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads
(New or Unknown Age) or
T- Wave Inversion in ≥ 2 Contiguous Leads (New or
Unknown Age) or
Left Bundle-Branch Block (New or Unknown Age)
Observation
Unit
Inpatient
Telemetry Unit
High Risk
Low Risk
Very Low Risk
Moderate Risk
Yes
No
Coronary
Care Unit
No
ECG Evidence of Acute Myocardial Infarction (MI)?
ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous
Leads (New or Unknown Age)
or
Pathologic Q Waves in ≥ 2 Contiguous Leads (New
or Unknown Age)
Yes
Patient suspected of
Acute Cardiac
Ischema
Perform
Electrocardiogram
(EKG)
0 Factors
2 or 3 Factors
1 Factors
0 or 1 Factors
2 or 3 Factors
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
22. Our Thought Process
*** not very reliable
Cognition
Limbic Center
(hypocampus and amygdala)
Cortex
(hypocampus and amygdala)
Conscious Choice
(via motor centers)
Most primitive, seat of unconscious
Long-term memory
Conscious, meaning, choice
Perception
(via the senses)***
Pre-Frontal Cortex
(hypocampus and amygdala)
Stimulus
23. Short Term Memory
Your Brain
Working Memory
Understanding
Judgement
Relationship
Short-term memory is
where the real work of
sense-making takes place
Short-term memory
has a limited
amount of space
(The estimate is 7 ± 2)
30. 1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems.
Human Factors 37(1), 32–64.!
!
Our systems are capable of
producing a huge amount of data,
both on the status of their own
components and on the status of
the environment. The problem
with today’s systems is not a lack
of information, but finding what is
needed when it is needed.
31. Our success in any endeavor depends directly on
our ability to solve problems
What do we need to do that?
33. Common Problem Types
§ Design Problems
§ Creative Problems
§ Daily Problems
§ People Problems
Rule-Based
Approach
Event Based
Approach
34. The Problem with the
Rules-Based Approach
• Solutions are driven by accepted conventions
• Best practices are coveted and are adopted without
understanding how and why they were developed
• There must always be a right answer
• No logical analysis is required
• People are frequently seen as the “root cause”
• The outcomes are enforced using “re-dos” and
punitive actions (or the looming threat of these things)
35. Event-Based Problem Solving
• Appreciative Understanding
• Know What We Are Solving
• Create A Common Reality
• Solutions Based on Causes
36.
37. The Pre-Mortem Process
Define the
Problem
Chart the
Causal
Relationships
and Add
EVidence
Identify
Solutions
Implement
the Solutions
39. Problem Definition
• What:
• When:
Date/Time:
Relative: what was happening at the time of this event?
• Where:
Specific:
Relative: logical dependencies?
• Significance:
availability:
environment:
costs:
revenue
maintenance?
other miscellaneous costs
frequency:
40. Gut Check…
• Why are we working on this?
• How much time should we spend?
• What people do we need?
• How much money should we spend?
You should be able to answer all of the following:
41. The What Statement
• It is used as “The Primary Effect (PE)”
– It is a statement of what we want to prevent from
happening again
• There may be more than one
– If they are unrelated, perform separate RCA’s
– If they are related and you can’t decide which to
use, pick the one that is nearest to the present
time
• Noun/verb statement
43. The T: Drive
reached 0 Bytes
free
The database
stopped
processing
queries
The application
server was timing
out
Users were
getting 500 errors
on the website
Customers to call
the helpdesk to
complain
Add more hard
drive space
Have you see something like this before?
What do we really know?
44. It’s never that simple
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are fixed
Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
46. Rules for Causal
Relationships
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
② You can keep identifying causes – there is no limit!
47. Two Important Questions
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
Ask “Why?”
Ask “What”
48. Rules for Causal
Relationships
③ An Effect is often the result of multiple causes!
SQL Server was
not processing
queries (Effect)!
Transaction log
was unable to grow!
T: Drive at 0 Bytes
free!
Logs were not
truncated!
DBA on
honeymoon
vacation in Fiji!
Logs are truncated
manually!
Company has only
1 DBA!
“Backup” DBA was
not aware the logs
require truncation!
Space allocations
are fixed! Lack of Control!
-AND-!
-AND-!
-AND-!
49. Rules for Causal
Relationships
④ Causes need to be both necessary and sufficient!
SQL Server was not
processing queries
(Effect)!
Transaction log was
unable to grow
(Transitory Cause)!
T: Drive at 0 Bytes free!
(Non-transitory Cause
& Effect)!
Logs were not
truncated!
(Transitory Cause &
Effect)!
DBA on honeymoon
vacation in Fiji!
(Transitory Cause)!
Logs are truncated
manually!
(Non-Transitory Cause)!
Company has only 1
DBA!
(Non-Transitory Cause)!
“Backup” DBA was not
aware the logs require
truncation!
(Non-Transitory Cause)!
Space allocations are
fixed!
(Non-Transitory Cause)!
Lack of Control!
-AND-!
-AND-!
-AND-!
51. RCA Diagram
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are fixed
Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
52. Add Evidence
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are fixed
Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Statistical Data
Situational
Observation
53. Examples of Evidence
• Personal experience or observation
• Statistical data (Monitoring Metrics)
• Examples, particular events, or situations that
illustrate
• Analogies (comparisons with similar situations)
• Informed opinion (the opinions of experts and
authorities)
• Historical documentation
• Experimental evidence
54. Ideas for Finding Causes
Causes
Management
Organization
Process
Knowledge
Technology
People
Information
Applications
Infrastructure
Capital
56. Failure Modes Analysis
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
“Backup” DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
fixed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is configured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving files in the
TEMP folder
%TEMP% configured to
C:Temp
Kernel able to write to
page file
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
57. Picking Monitors
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
“Backup” DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
fixed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is configured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving files in the
TEMP folder
%TEMP% configured to
C:Temp
Kernel able to write to
page file
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Monitor the
intersections at
the “OR’s”
At least one point
along each branch
after the “OR”
58. FMEA Matrix
(Impact Calculation)
Negligible (1-2): no loss in functionality,
mostly cosmetic
Marginal (3-4): temporary interruptions or the
degradation lasts for a brief period of time
Critical (5-6): the problem will not resolve
itself but a work around exists allowing the
problem to be bypassed
Serious (7-8): the problem will not resolve
itself and no work around is possible.
Functionality is impaired or lost but the
system is usable to some extent
Catastrophic (9-10): the system is
completely unusable
Improbable (1-2): less than 1 time per year
Remote (3-4): 1 time per year
Occasional (5-6): 1 time per month
Probable (7-8): 1 time per day
Chronic (9-10): 1 or more times per day
Very high (1-2): during the design phase
High (3-4): during peer review or unit testing
Moderate (5-6): during system testing or
acceptance testing
Remote (7-8): during or immediately after
production deployment
Very Remote (9-10): only after heavy usage
by users
59. FMEA Matrix
(Evidence)
These are the events that help us to RULE IN a
failure mode as a possible cause
These are the events that help us RULE OUT the
failure mode as not relevant
60. Application-Technology Matrix
Maps services, applications and technologies
enabling:
• Monitoring investment prioritization
• Monitoring maturity
• Which templates need to be deployed when new
hardware is acquired
• Whether an service has sufficient monitoring
coverage based on its application components
• This approach allows for anticipating changes to a
customer’s monitoring needs
Scores indicate:
0 – No Strategy
1 – Limited Monitoring
2 – Fully Integrated Strategy
62. During Service Support
• Command Centers and Support Teams
– Use the failure modes to rule out causes
– Each failure mode will have a documented process to
follow to mitigate the impact once the likely failure
mode is identified
• Incident Managers
– Start bridge calls and provide an accounting of all the
potential failure modes, which have been successfully
ruled out, and which need to be investigated
– Coordinate the investigation assignments and
consolidate the investigation results
63. Facilitating Production Assurance
• CritSits
– Start the CritSit meeting and provide an accounting of all the
potential failure modes, which have been successfully ruled out,
and which need to be investigated
– Initiate investigations / experiments by assign potential failure
modes to the incident response teams
• Problem Management
– Document the causal elements as new failure modes
– Disseminate new failure modes to Architecture, the Monitoring
Team, and the Command Center/Service Desk
• Reporting
– Produce a monthly news letter to application owners with the
list of failure modes they should discuss with their architects
– Incorporate failure modes into “Fault Line” analysis
64. During the Design Process
• Architects
– Certify that designs do not contain the known failure
modes or document that the failure mode does not
present an unacceptable risk
– Document the requirements for Solution Architects to
follow to ensure the mitigation strategies are
implemented
• Developers
– Certify that designs do not contain the known failure
modes or document that the failure mode does not
present an unacceptable risk
– Certify the designs implement the mitigation strategies
65. Improving Enterprise
Processes and Tools
• Systems Management and Monitoring
– Develop new monitoring requirements using the
documented indications and contraindications
• Event Management
– Develop new correlations tying indications and
contraindications to failure modes to assist in ruling out or
ruling in those “in play” more efficiently
• Configuration Management
– Develop new discovery patterns using the documented
indications and contraindications
– Develop automations to detect the presence of failure
mode conditions and generate an event to the Event
Management System
67. Running a Good Pre-Mortem
Defer
Judgment
Encourage
Wild Ideas
Build on
Ideas
Stay Focused
One Person
at a Time
Be Visual
Go for
Quantity
SUCCESSFUL
RCA
68. Here is Why It Works
RCA
Process
Re-
Establishes
Personal
Relationships
Social
Networks
Cooling-Off
Period
De-
Escalating
Gestures
Confidence-
Building
Measures
Trust
Building
Respect
69. Don’t try to create everything at once.
Knowledge is something that is
created over time.
Iterative Development