SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
Outage Insurance: Everything
You Need to Know
Mr. White has fifteen years of experience designing and managing the
deployment of Systems Monitoring and Event Management software. Prior
to joining IBM, Mr. White held various positions including the leader of the
Monitoring and Event Management organization of a Fortune 100 company
and developing solutions as a consultant for a wide variety of organizations,
including the Mexican Secretaría de Hacienda y Crédito Público, Telmex,
Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US
Navy Facilities and Engineering Command.
Andrew White
Cloud and Smarter Infrastructure Solution Specialist
IBM Corporation
http://weheartit.com/entry/12433848!
Ground rules for this
session…
•  If you can’t tell if I am trying to be funny…
–  
GO AHEAD AND LAUGH!
•  Feel free to text, tweet, yammer, or whatever
to share with the rest of the attendees
•  If you have a question, no need to wait until
the end. Just interrupt me. Seriously… I
don’t mind.
I am here today to share some of what I have learned about
We (IT) sells promises…
The value of these promises depends on the
customer’s perception that we are willing and
capable of making good on the promise when
the time comes. This perception is affected by
the interactions they have with us.
http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!
Objective #1: Users Love Our IT Systems…
Anatomy of an Outage
Corporate!
LANs & VPNs!
Load Balancer!
Firewall!
Web!
Servers!
Message!
Queue!
zOS!
CICS!
WAS!
Database!
WAS!
Database!
zOS!
MQ!
DB2!
!
!
!
!
4!
!
!
!
!
!
!
3!
!
!
!
!
!
!
1!
5:45-ish pm: CICS ABENDS
start flooding the console but
not high enough to ticket!
!
!
!
!
!
!
2!
6:00-ish pm: MQ flows start are interrupted
and are alerting in Flow Diagnostics!
6:04pm: Synthetic transactions fail at
and 6:14 the Ops Center confirms the
issue and creates a P0 Incident!
6:54pm: Support teams
investigate the interrupted
flows and determine it is a
“back-end” problem!
10:29pm: Support teams
investigate MQ and ultimately
and rule it out and ultimately
decide to reset CICS to resolve
the issue!
!
!
!
!
5!
http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/!
Bad Experience!!!
h"p://www.ithakabound.com/wp-­‐content/uploads/2010/02/DC-­‐Snow-­‐men-­‐pushing-­‐car.jpg	
  
Why did this happen?!
Why is problem solving hard?
• commencement opacity
• continuation opacity
Non-transparency (lack of
clarity of the situation)
• inexpressiveness
• opposition
• transience
Polytely (multiple goals)
• enumerability
• connectivity (hierarchy relation, communication relation, allocation
relation)
• heterogeneity
Complexity (large numbers
of items, interrelations,
and decisions)
• temporal constraints
• temporal sensitivity
• phase effects
• dynamic unpredictability
Dynamics (time
considerations)
Boyd’s Loop
Observation
Outside
Information
Implicit Guidance & Control
Unfolding Interaction With Environment
Feedback
Feedback
Unfolding
Circumstances
 Cultural
Norms
Cognitive
Abilities
Knowledge 
Life Cycle
Prior
Wisdom
New 
Information
Feed
Forward
 Decision
(Hypothesis)
Feed
Forward
 Action
(Test)
Feed
Forward
•  Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the
feedback and other phenomena coming into our sensing or observing window.
•  Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing
process of projection, empathy, correlation, and rejection.

From “The Essence of Winning and Losing,” John R. Boyd, January 1996.
Observe Orient Decide Act
Where the Breakdown
Occurs
Observe! Orient! Decide! Act!
Situational Awareness!
Perception of
Elements in
Current Situation!
!
Level 1!
Comprehension
of Current
Situation!
!
Level 2!
Projection of
Future Status!
!
!
Level 3!
Decision!
Performance
of Actions!
CurrentState!
Feedback!
• Goals & Objectives!
• Preconceptions!
• Expectations!
• Abilities!
• Experience!
• Training!
Long Term
Memory!
Automaticity!
Cognitive Processes!
• System Capability!
• Interface Design!
• Stress & Workload!
• Complexity!
• Automation!
Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness
in dynamic systems. Human Factors 37(1), 32–64.!
Systemic Influences!
Individual Influences!
Incident Life Cycle
Down Time
Detection Time
 Response Time
 Repair Time
 Recovery Time
Outage
Detection
Diagnosis
Repair
Recover
Restore
Observe
 Orient
 Decide
 Act
Problem Life Cycle
Evaluation	
  
Recognition
Observation
AnalysisSolution
Validation
Control
Point of
Observation
Past Behavior
• The observation period
used to feed the
forecasting models
Future Behavior
• The performance
period the model is
trying to predict
Predictive Modeling Timeline
Predictive models
harness the information
lost in past data so you
can identify discretely
identify situations and
react to them quickly.
What Matters Most?
Dr.	
  Lee	
  
Goldman	
  
Cook	
  County	
  Hospital,	
  
Chicago,	
  IL	
  
§  Is the patient feeling unstable
angina?
§  Is there fluid in the patient’s lungs?
§  Is the patient’s systolic blood
pressure below 100?"

The Goldman Algorithm
Prediction of Patients Expected to
Have a Heart Attack Within 72 Hours
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
Traditional Techniques
 Goldman Algorithm
By paying attention to what really matters, Dr.
Goldman improved the “false negatives” by 20
percentage points and eliminated the “false
positives” altogether.
The Goldman Algorithm
ECG Evidence of Acute Ischemia?
ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads
(New or Unknown Age) or
T- Wave Inversion in ≥ 2 Contiguous Leads (New or
Unknown Age) or
Left Bundle-Branch Block (New or Unknown Age)
Observation
Unit
Inpatient
Telemetry Unit
High Risk
 Low Risk
 Very Low Risk
Moderate Risk
Yes
 No
Coronary
Care Unit
No
ECG Evidence of Acute Myocardial Infarction (MI)?
ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous
Leads (New or Unknown Age)
or
Pathologic Q Waves in ≥ 2 Contiguous Leads (New
or Unknown Age)
Yes
Patient suspected of
Acute Cardiac
Ischema
Perform
Electrocardiogram
(EKG)
0 Factors
2 or 3 Factors
 1 Factors
0 or 1 Factors
2 or 3 Factors
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
First…
… we need
to talk a little
bit about
your brain
The Triune Brain
Reptilian Brain
(basal ganglia)
Mammalian Brain
(limbic system)
Cognitive Brain
(neocortex)
Our Thought Process
*** not very reliable
Cognition
Limbic Center
(hypocampus and amygdala)
Cortex
(hypocampus and amygdala)
Conscious Choice
(via motor centers)
Most primitive, seat of unconscious
Long-term memory
Conscious, meaning, choice
Perception
(via the senses)***
Pre-Frontal Cortex
(hypocampus and amygdala)
Stimulus
Short Term Memory
Your Brain
Working Memory
Understanding
Judgement
Relationship
Short-term memory is
where the real work of
sense-making takes place
Short-term memory
has a limited
amount of space
(The estimate is 7 ± 2)
The big-data dilema
Time
Quantity
Information the brain can consume
Information is cheap.
Understanding is expensive.
-Karl Fast, Professor of UX Design, Kent State University
• Patterns
• Comparisons
• Organization
Information
• Decisions
• Skill
• Adaptation
Intelligence
• Trends
• Generalizations
• Beliefs
Knowledge
• Accountability
• Foresight
• Synthesis
Wisdom
• Symbols
• Metrics
• Facts
Data
Correlation
Analysis
Application
Understanding
Complexity
Context
Communication
Repetition
From Data to Wisdom
x
y
0i i i i
y xα α ε= + +
Data
Information
Knowledge
Past
 Future
Abstract
Tangible
Information
 Intelligence
Knowledge
 Wisdom
Data
Knowledge is the point of transition
Why Knowledge?
All You Need
Love
1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems.
Human Factors 37(1), 32–64.!
!
Our systems are capable of
producing a huge amount of data,
both on the status of their own
components and on the status of
the environment. The problem
with today’s systems is not a lack
of information, but finding what is
needed when it is needed.
Our success in any endeavor depends directly on
our ability to solve problems
What do we need to do that?
You Gotta Have Skillz…!
Common Problem Types
§  Design Problems
§  Creative Problems
§  Daily Problems
§  People Problems
Rule-Based
Approach
Event Based
Approach
The Problem with the
Rules-Based Approach
•  Solutions are driven by accepted conventions
•  Best practices are coveted and are adopted without
understanding how and why they were developed
•  There must always be a right answer
•  No logical analysis is required
•  People are frequently seen as the “root cause”
•  The outcomes are enforced using “re-dos” and
punitive actions (or the looming threat of these things)
Event-Based Problem Solving
•  Appreciative Understanding
•  Know What We Are Solving
•  Create A Common Reality
•  Solutions Based on Causes
The Pre-Mortem Process
Define the
Problem
Chart the
Causal
Relationships
and Add
EVidence
Identify
Solutions
Implement
the Solutions
Step 1: Define the Problem
Problem Definition
•  What: 
•  When: 
Date/Time:
Relative: what was happening at the time of this event?
•  Where:
Specific:
Relative: logical dependencies?
•  Significance:
availability:
environment:
costs: 
revenue

maintenance?

other miscellaneous costs
frequency:
Gut Check…
•  Why are we working on this? 
•  How much time should we spend?
•  What people do we need?
•  How much money should we spend?
You should be able to answer all of the following:
The What Statement
•  It is used as “The Primary Effect (PE)”
– It is a statement of what we want to prevent from
happening again
•  There may be more than one
– If they are unrelated, perform separate RCA’s
– If they are related and you can’t decide which to
use, pick the one that is nearest to the present
time
•  Noun/verb statement
Step 2: Add Causal
Relationships and Evidence
The T: Drive
reached 0 Bytes
free
The database
stopped
processing
queries
The application
server was timing
out
Users were
getting 500 errors
on the website
Customers to call
the helpdesk to
complain
Add more hard
drive space
Have you see something like this before?
What do we really know?
It’s never that simple
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are fixed
 Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
 Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Rules for Causal
Relationships
Database
Down !
(Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause)!
①  Causes are effects, and effects are causes!
Rules for Causal
Relationships
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
②  You can keep identifying causes – there is no limit!
Two Important Questions
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
Ask “Why?”
Ask “What”
Rules for Causal
Relationships
③  An Effect is often the result of multiple causes!
SQL Server was
not processing
queries (Effect)!
Transaction log
was unable to grow!
T: Drive at 0 Bytes
free!
Logs were not
truncated!
DBA on
honeymoon
vacation in Fiji!
Logs are truncated
manually!
Company has only
1 DBA!
“Backup” DBA was
not aware the logs
require truncation!
Space allocations
are fixed! Lack of Control!
-AND-!
-AND-!
-AND-!
Rules for Causal
Relationships
④  Causes need to be both necessary and sufficient!
SQL Server was not
processing queries
(Effect)!
Transaction log was
unable to grow
(Transitory Cause)!
T: Drive at 0 Bytes free!
(Non-transitory Cause
& Effect)!
Logs were not
truncated!
(Transitory Cause &
Effect)!
DBA on honeymoon
vacation in Fiji!
(Transitory Cause)!
Logs are truncated
manually!
(Non-Transitory Cause)!
Company has only 1
DBA!
(Non-Transitory Cause)!
“Backup” DBA was not
aware the logs require
truncation!
(Non-Transitory Cause)!
Space allocations are
fixed!
(Non-Transitory Cause)!
Lack of Control!
-AND-!
-AND-!
-AND-!
How Fire Works
Time
Oxygen
Heat
Fuel
Fire
MatchStrike
Transitory
Non-Transitory
Fire
Oxygen
Heat
Fuel
Match
Strike
-AND-
• Transitory Causes act as catalysts to bring
about change (think Transition)
• Non-Transitory Causes are objects,
properties/attributes, and status
RCA Diagram
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are fixed
 Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
 Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Add Evidence
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are fixed
 Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
 Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Statistical Data
Situational
Observation
Examples of Evidence
•  Personal experience or observation
•  Statistical data (Monitoring Metrics)
•  Examples, particular events, or situations that
illustrate
•  Analogies (comparisons with similar situations)
•  Informed opinion (the opinions of experts and
authorities)
•  Historical documentation
•  Experimental evidence
Ideas for Finding Causes
Causes
Management
Organization
Process
Knowledge
Technology
People
Information
Applications
Infrastructure
Capital
Step 3: Find Solutions
Failure Modes Analysis
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
“Backup” DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
fixed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results 
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is configured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving files in the
TEMP folder
%TEMP% configured to
C:Temp
Kernel able to write to
page file
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Picking Monitors
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
“Backup” DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
fixed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results 
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is configured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving files in the
TEMP folder
%TEMP% configured to
C:Temp
Kernel able to write to
page file
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Monitor the
intersections at
the “OR’s”
At least one point
along each branch
after the “OR”
FMEA Matrix
(Impact Calculation)
Negligible (1-2): no loss in functionality,
mostly cosmetic
Marginal (3-4): temporary interruptions or the
degradation lasts for a brief period of time
Critical (5-6): the problem will not resolve
itself but a work around exists allowing the
problem to be bypassed
Serious (7-8): the problem will not resolve
itself and no work around is possible.
Functionality is impaired or lost but the
system is usable to some extent
Catastrophic (9-10): the system is
completely unusable
Improbable (1-2): less than 1 time per year
Remote (3-4): 1 time per year
Occasional (5-6): 1 time per month
Probable (7-8): 1 time per day
Chronic (9-10): 1 or more times per day
Very high (1-2): during the design phase
High (3-4): during peer review or unit testing
Moderate (5-6): during system testing or
acceptance testing
Remote (7-8): during or immediately after
production deployment
Very Remote (9-10): only after heavy usage
by users
FMEA Matrix
(Evidence)
These are the events that help us to RULE IN a
failure mode as a possible cause
These are the events that help us RULE OUT the
failure mode as not relevant
Application-Technology Matrix
Maps services, applications and technologies
enabling:
• Monitoring investment prioritization
• Monitoring maturity
• Which templates need to be deployed when new
hardware is acquired
• Whether an service has sufficient monitoring
coverage based on its application components
• This approach allows for anticipating changes to a
customer’s monitoring needs
Scores indicate:
0 – No Strategy
1 – Limited Monitoring
2 – Fully Integrated Strategy
Step 4: Use this
knowledge intelligently
During Service Support
•  Command Centers and Support Teams
–  Use the failure modes to rule out causes
–  Each failure mode will have a documented process to
follow to mitigate the impact once the likely failure
mode is identified
•  Incident Managers
–  Start bridge calls and provide an accounting of all the
potential failure modes, which have been successfully
ruled out, and which need to be investigated
–  Coordinate the investigation assignments and
consolidate the investigation results
Facilitating Production Assurance
•  CritSits
–  Start the CritSit meeting and provide an accounting of all the
potential failure modes, which have been successfully ruled out,
and which need to be investigated
–  Initiate investigations / experiments by assign potential failure
modes to the incident response teams
•  Problem Management
–  Document the causal elements as new failure modes
–  Disseminate new failure modes to Architecture, the Monitoring
Team, and the Command Center/Service Desk
•  Reporting
–  Produce a monthly news letter to application owners with the
list of failure modes they should discuss with their architects
–  Incorporate failure modes into “Fault Line” analysis
During the Design Process
•  Architects 

–  Certify that designs do not contain the known failure
modes or document that the failure mode does not
present an unacceptable risk
–  Document the requirements for Solution Architects to
follow to ensure the mitigation strategies are
implemented
•  Developers
–  Certify that designs do not contain the known failure
modes or document that the failure mode does not
present an unacceptable risk
–  Certify the designs implement the mitigation strategies
Improving Enterprise
Processes and Tools
•  Systems Management and Monitoring
–  Develop new monitoring requirements using the
documented indications and contraindications
•  Event Management
–  Develop new correlations tying indications and
contraindications to failure modes to assist in ruling out or
ruling in those “in play” more efficiently
•  Configuration Management
–  Develop new discovery patterns using the documented
indications and contraindications
–  Develop automations to detect the presence of failure
mode conditions and generate an event to the Event
Management System
A few final thoughts…
Running a Good Pre-Mortem
Defer
Judgment
Encourage
Wild Ideas
Build on
Ideas
Stay Focused
One Person
at a Time
Be Visual
Go for
Quantity
SUCCESSFUL
RCA
Here is Why It Works
RCA
Process
Re-
Establishes
Personal
Relationships
Social
Networks
Cooling-Off
Period
De-
Escalating
Gestures
Confidence-
Building
Measures
Trust
Building
Respect
Don’t try to create everything at once.
Knowledge is something that is
created over time.
Iterative Development
Let’s keep the
conversation going…
Andrew.P.White@Gmail.com!
ReverendDrew!
SystemsManagementZen.Wordpress.com!
systemsmanagementzen.wordpress.com/feed/!
@SystemsMgmtZen!
ReverendDrew!
APWhite@us.ibm.com!
614-306-3434!

Contenu connexe

Tendances

Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...
Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...
Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...Steve Werby
 
Be More Secure than your Competition: MePush Cyber Security for Small Business
Be More Secure than your Competition:  MePush Cyber Security for Small BusinessBe More Secure than your Competition:  MePush Cyber Security for Small Business
Be More Secure than your Competition: MePush Cyber Security for Small BusinessArt Ocain
 
How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoringAndrew White
 
How Digital Trends Are Compressing Processes
How Digital Trends Are Compressing ProcessesHow Digital Trends Are Compressing Processes
How Digital Trends Are Compressing ProcessesSharon Richardson
 
Automated decision making with predictive applications – Big Data Amsterdam
Automated decision making with predictive applications – Big Data AmsterdamAutomated decision making with predictive applications – Big Data Amsterdam
Automated decision making with predictive applications – Big Data AmsterdamLars Trieloff
 
Building a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering FailureBuilding a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering Failurejgoulah
 
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarBeyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarKaren Skiles
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialWill Gallego
 
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...European Data Forum
 
Jim Proce May 2019 APWA Reporter - Smart Meters
Jim Proce May 2019 APWA Reporter - Smart MetersJim Proce May 2019 APWA Reporter - Smart Meters
Jim Proce May 2019 APWA Reporter - Smart MetersJim Proce
 
Using Periodic Audits To Prevent Catastrophic Project Failure
Using Periodic Audits To Prevent Catastrophic Project FailureUsing Periodic Audits To Prevent Catastrophic Project Failure
Using Periodic Audits To Prevent Catastrophic Project Failureicgfmconference
 
Do end-users fit the informatics requirements?
Do end-users fit the informatics requirements?Do end-users fit the informatics requirements?
Do end-users fit the informatics requirements?John Trigg
 
One hundred rules for nasa project managers
One hundred rules for nasa project managersOne hundred rules for nasa project managers
One hundred rules for nasa project managersAndreea Mocanu
 
COMPISSUES08 - Credibility of Technology
COMPISSUES08 - Credibility of TechnologyCOMPISSUES08 - Credibility of Technology
COMPISSUES08 - Credibility of TechnologyMichael Heron
 
Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...
Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...
Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...University of Hertfordshire
 

Tendances (20)

Creating a Technology Disaster Plan
Creating a Technology Disaster PlanCreating a Technology Disaster Plan
Creating a Technology Disaster Plan
 
Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...
Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...
Bad Advice, Unintended Consequences, and Broken Paradigms: Think & Act Di...
 
Be More Secure than your Competition: MePush Cyber Security for Small Business
Be More Secure than your Competition:  MePush Cyber Security for Small BusinessBe More Secure than your Competition:  MePush Cyber Security for Small Business
Be More Secure than your Competition: MePush Cyber Security for Small Business
 
How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoring
 
How Digital Trends Are Compressing Processes
How Digital Trends Are Compressing ProcessesHow Digital Trends Are Compressing Processes
How Digital Trends Are Compressing Processes
 
Automated decision making with predictive applications – Big Data Amsterdam
Automated decision making with predictive applications – Big Data AmsterdamAutomated decision making with predictive applications – Big Data Amsterdam
Automated decision making with predictive applications – Big Data Amsterdam
 
Dit yvol3iss41
Dit yvol3iss41Dit yvol3iss41
Dit yvol3iss41
 
Building a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering FailureBuilding a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering Failure
 
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarBeyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
 
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
 
Jim Proce May 2019 APWA Reporter - Smart Meters
Jim Proce May 2019 APWA Reporter - Smart MetersJim Proce May 2019 APWA Reporter - Smart Meters
Jim Proce May 2019 APWA Reporter - Smart Meters
 
The foundations of agile
The foundations of agileThe foundations of agile
The foundations of agile
 
Using Periodic Audits To Prevent Catastrophic Project Failure
Using Periodic Audits To Prevent Catastrophic Project FailureUsing Periodic Audits To Prevent Catastrophic Project Failure
Using Periodic Audits To Prevent Catastrophic Project Failure
 
Do end-users fit the informatics requirements?
Do end-users fit the informatics requirements?Do end-users fit the informatics requirements?
Do end-users fit the informatics requirements?
 
Diy (Health) Care
Diy (Health) CareDiy (Health) Care
Diy (Health) Care
 
One hundred rules for nasa project managers
One hundred rules for nasa project managersOne hundred rules for nasa project managers
One hundred rules for nasa project managers
 
COMPISSUES08 - Credibility of Technology
COMPISSUES08 - Credibility of TechnologyCOMPISSUES08 - Credibility of Technology
COMPISSUES08 - Credibility of Technology
 
Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...
Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...
Technologies of Attractions - Museums, Galaries, Zoos, Castles, Dockyards, Fu...
 
Projektledelse og softwareinnovation
Projektledelse og softwareinnovationProjektledelse og softwareinnovation
Projektledelse og softwareinnovation
 

Similaire à Brighttalk outage insurance- what you need to know - final

Data Driven Risk Management
Data Driven Risk ManagementData Driven Risk Management
Data Driven Risk ManagementResolver Inc.
 
Session 10. Risk Management lekture financial markets
Session 10. Risk Management lekture financial marketsSession 10. Risk Management lekture financial markets
Session 10. Risk Management lekture financial marketsparketkril09
 
4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...
4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...
4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...OECD Governance
 
David Hancock - Risk Leadership in a world of Uncertainty and Ambiguity
David Hancock - Risk Leadership in a world of Uncertainty and AmbiguityDavid Hancock - Risk Leadership in a world of Uncertainty and Ambiguity
David Hancock - Risk Leadership in a world of Uncertainty and AmbiguityAssociation for Project Management
 
What is Kaizen
What is KaizenWhat is Kaizen
What is KaizenKira Greer
 
Presentation at 2007 Meeting of Indian Health Service in San Diego
Presentation at 2007 Meeting of Indian Health Service in San DiegoPresentation at 2007 Meeting of Indian Health Service in San Diego
Presentation at 2007 Meeting of Indian Health Service in San DiegoNoel Eldridge
 
Managing Risk or Reacting to Compliance
Managing Risk or Reacting to ComplianceManaging Risk or Reacting to Compliance
Managing Risk or Reacting to ComplianceEvan Francen
 
DeepSec 2014 - The Measured CSO
DeepSec 2014 - The Measured CSODeepSec 2014 - The Measured CSO
DeepSec 2014 - The Measured CSOAlexander Hutton
 
Automated Decision Making with Predictive Applications – Big Data Düsseldorf
Automated Decision Making with Predictive Applications – Big Data DüsseldorfAutomated Decision Making with Predictive Applications – Big Data Düsseldorf
Automated Decision Making with Predictive Applications – Big Data DüsseldorfLars Trieloff
 
"Security on the Brain" Security & Risk Psychology Workshop Nov 2013
"Security on the Brain" Security & Risk Psychology Workshop Nov 2013"Security on the Brain" Security & Risk Psychology Workshop Nov 2013
"Security on the Brain" Security & Risk Psychology Workshop Nov 2013Adrian Wright
 
Crisis Management in Organization Development by The College of Saint Scholas...
Crisis Management in Organization Development by The College of Saint Scholas...Crisis Management in Organization Development by The College of Saint Scholas...
Crisis Management in Organization Development by The College of Saint Scholas...Atlantic Training, LLC.
 
How Big Data identifies early indicators of Mental Stress
How Big Data identifies early indicators of Mental StressHow Big Data identifies early indicators of Mental Stress
How Big Data identifies early indicators of Mental StressCoert Du Plessis (杜康)
 
Automated Decision making with Predictive Applications – Big Data Hamburg
Automated Decision making with Predictive Applications – Big Data HamburgAutomated Decision making with Predictive Applications – Big Data Hamburg
Automated Decision making with Predictive Applications – Big Data HamburgLars Trieloff
 
IPICD 2019 (the value of a systems perspective)
IPICD 2019 (the value of a systems perspective)IPICD 2019 (the value of a systems perspective)
IPICD 2019 (the value of a systems perspective)John Black
 
Yours Anecdotally: Developing a Cybersecurity Problem Space
Yours Anecdotally: Developing a Cybersecurity Problem SpaceYours Anecdotally: Developing a Cybersecurity Problem Space
Yours Anecdotally: Developing a Cybersecurity Problem SpaceJack Whitsitt
 
Problem management foundation - IT risk
Problem management foundation - IT riskProblem management foundation - IT risk
Problem management foundation - IT riskRonald Bartels
 
How to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat CroskerryHow to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat CroskerrySMACC Conference
 
Presentation debiasing m-azimi,amshirazi,hdarzi
Presentation debiasing   m-azimi,amshirazi,hdarzi Presentation debiasing   m-azimi,amshirazi,hdarzi
Presentation debiasing m-azimi,amshirazi,hdarzi Omid Aminzadeh Gohari
 

Similaire à Brighttalk outage insurance- what you need to know - final (20)

Data Driven Risk Management
Data Driven Risk ManagementData Driven Risk Management
Data Driven Risk Management
 
Session 10. Risk Management lekture financial markets
Session 10. Risk Management lekture financial marketsSession 10. Risk Management lekture financial markets
Session 10. Risk Management lekture financial markets
 
4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...
4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...
4th Workshop on Strategic Crisis Management, Keynote Presentation - Strategic...
 
Uncertainty and your brain
Uncertainty and your brainUncertainty and your brain
Uncertainty and your brain
 
David Hancock - Risk Leadership in a world of Uncertainty and Ambiguity
David Hancock - Risk Leadership in a world of Uncertainty and AmbiguityDavid Hancock - Risk Leadership in a world of Uncertainty and Ambiguity
David Hancock - Risk Leadership in a world of Uncertainty and Ambiguity
 
What is Kaizen
What is KaizenWhat is Kaizen
What is Kaizen
 
Presentation at 2007 Meeting of Indian Health Service in San Diego
Presentation at 2007 Meeting of Indian Health Service in San DiegoPresentation at 2007 Meeting of Indian Health Service in San Diego
Presentation at 2007 Meeting of Indian Health Service in San Diego
 
Managing Risk or Reacting to Compliance
Managing Risk or Reacting to ComplianceManaging Risk or Reacting to Compliance
Managing Risk or Reacting to Compliance
 
DeepSec 2014 - The Measured CSO
DeepSec 2014 - The Measured CSODeepSec 2014 - The Measured CSO
DeepSec 2014 - The Measured CSO
 
Automated Decision Making with Predictive Applications – Big Data Düsseldorf
Automated Decision Making with Predictive Applications – Big Data DüsseldorfAutomated Decision Making with Predictive Applications – Big Data Düsseldorf
Automated Decision Making with Predictive Applications – Big Data Düsseldorf
 
"Security on the Brain" Security & Risk Psychology Workshop Nov 2013
"Security on the Brain" Security & Risk Psychology Workshop Nov 2013"Security on the Brain" Security & Risk Psychology Workshop Nov 2013
"Security on the Brain" Security & Risk Psychology Workshop Nov 2013
 
Crisis Management in Organization Development by The College of Saint Scholas...
Crisis Management in Organization Development by The College of Saint Scholas...Crisis Management in Organization Development by The College of Saint Scholas...
Crisis Management in Organization Development by The College of Saint Scholas...
 
How Big Data identifies early indicators of Mental Stress
How Big Data identifies early indicators of Mental StressHow Big Data identifies early indicators of Mental Stress
How Big Data identifies early indicators of Mental Stress
 
Automated Decision making with Predictive Applications – Big Data Hamburg
Automated Decision making with Predictive Applications – Big Data HamburgAutomated Decision making with Predictive Applications – Big Data Hamburg
Automated Decision making with Predictive Applications – Big Data Hamburg
 
IPICD 2019 (the value of a systems perspective)
IPICD 2019 (the value of a systems perspective)IPICD 2019 (the value of a systems perspective)
IPICD 2019 (the value of a systems perspective)
 
Yours Anecdotally: Developing a Cybersecurity Problem Space
Yours Anecdotally: Developing a Cybersecurity Problem SpaceYours Anecdotally: Developing a Cybersecurity Problem Space
Yours Anecdotally: Developing a Cybersecurity Problem Space
 
Educause+V4.ppt
Educause+V4.pptEducause+V4.ppt
Educause+V4.ppt
 
Problem management foundation - IT risk
Problem management foundation - IT riskProblem management foundation - IT risk
Problem management foundation - IT risk
 
How to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat CroskerryHow to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat Croskerry
 
Presentation debiasing m-azimi,amshirazi,hdarzi
Presentation debiasing   m-azimi,amshirazi,hdarzi Presentation debiasing   m-azimi,amshirazi,hdarzi
Presentation debiasing m-azimi,amshirazi,hdarzi
 

Dernier

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Dernier (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Brighttalk outage insurance- what you need to know - final

  • 2. Mr. White has fifteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command. Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation
  • 4. Ground rules for this session… •  If you can’t tell if I am trying to be funny… –  GO AHEAD AND LAUGH! •  Feel free to text, tweet, yammer, or whatever to share with the rest of the attendees •  If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.
  • 5. I am here today to share some of what I have learned about
  • 6. We (IT) sells promises… The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us.
  • 8. Anatomy of an Outage Corporate! LANs & VPNs! Load Balancer! Firewall! Web! Servers! Message! Queue! zOS! CICS! WAS! Database! WAS! Database! zOS! MQ! DB2! ! ! ! ! 4! ! ! ! ! ! ! 3! ! ! ! ! ! ! 1! 5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket! ! ! ! ! ! ! 2! 6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics! 6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident! 6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem! 10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue! ! ! ! ! 5!
  • 11. Why is problem solving hard? • commencement opacity • continuation opacity Non-transparency (lack of clarity of the situation) • inexpressiveness • opposition • transience Polytely (multiple goals) • enumerability • connectivity (hierarchy relation, communication relation, allocation relation) • heterogeneity Complexity (large numbers of items, interrelations, and decisions) • temporal constraints • temporal sensitivity • phase effects • dynamic unpredictability Dynamics (time considerations)
  • 12. Boyd’s Loop Observation Outside Information Implicit Guidance & Control Unfolding Interaction With Environment Feedback Feedback Unfolding Circumstances Cultural Norms Cognitive Abilities Knowledge Life Cycle Prior Wisdom New Information Feed Forward Decision (Hypothesis) Feed Forward Action (Test) Feed Forward •  Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window. •  Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection. From “The Essence of Winning and Losing,” John R. Boyd, January 1996. Observe Orient Decide Act
  • 13. Where the Breakdown Occurs Observe! Orient! Decide! Act! Situational Awareness! Perception of Elements in Current Situation! ! Level 1! Comprehension of Current Situation! ! Level 2! Projection of Future Status! ! ! Level 3! Decision! Performance of Actions! CurrentState! Feedback! • Goals & Objectives! • Preconceptions! • Expectations! • Abilities! • Experience! • Training! Long Term Memory! Automaticity! Cognitive Processes! • System Capability! • Interface Design! • Stress & Workload! • Complexity! • Automation! Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.! Systemic Influences! Individual Influences!
  • 14. Incident Life Cycle Down Time Detection Time Response Time Repair Time Recovery Time Outage Detection Diagnosis Repair Recover Restore Observe Orient Decide Act
  • 15. Problem Life Cycle Evaluation   Recognition Observation AnalysisSolution Validation Control
  • 16. Point of Observation Past Behavior • The observation period used to feed the forecasting models Future Behavior • The performance period the model is trying to predict Predictive Modeling Timeline
  • 17. Predictive models harness the information lost in past data so you can identify discretely identify situations and react to them quickly.
  • 18. What Matters Most? Dr.  Lee   Goldman   Cook  County  Hospital,   Chicago,  IL   §  Is the patient feeling unstable angina? §  Is there fluid in the patient’s lungs? §  Is the patient’s systolic blood pressure below 100?" The Goldman Algorithm Prediction of Patients Expected to Have a Heart Attack Within 72 Hours 0   20   40   60   80   100   Traditional Techniques Goldman Algorithm By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20 percentage points and eliminated the “false positives” altogether.
  • 19. The Goldman Algorithm ECG Evidence of Acute Ischemia? ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or Left Bundle-Branch Block (New or Unknown Age) Observation Unit Inpatient Telemetry Unit High Risk Low Risk Very Low Risk Moderate Risk Yes No Coronary Care Unit No ECG Evidence of Acute Myocardial Infarction (MI)? ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age) Yes Patient suspected of Acute Cardiac Ischema Perform Electrocardiogram (EKG) 0 Factors 2 or 3 Factors 1 Factors 0 or 1 Factors 2 or 3 Factors Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease
  • 20. First… … we need to talk a little bit about your brain
  • 21. The Triune Brain Reptilian Brain (basal ganglia) Mammalian Brain (limbic system) Cognitive Brain (neocortex)
  • 22. Our Thought Process *** not very reliable Cognition Limbic Center (hypocampus and amygdala) Cortex (hypocampus and amygdala) Conscious Choice (via motor centers) Most primitive, seat of unconscious Long-term memory Conscious, meaning, choice Perception (via the senses)*** Pre-Frontal Cortex (hypocampus and amygdala) Stimulus
  • 23. Short Term Memory Your Brain Working Memory Understanding Judgement Relationship Short-term memory is where the real work of sense-making takes place Short-term memory has a limited amount of space (The estimate is 7 ± 2)
  • 25. Information is cheap. Understanding is expensive. -Karl Fast, Professor of UX Design, Kent State University
  • 27. x y 0i i i i y xα α ε= + + Data Information Knowledge
  • 28. Past Future Abstract Tangible Information Intelligence Knowledge Wisdom Data Knowledge is the point of transition Why Knowledge?
  • 30. 1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.! ! Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.
  • 31. Our success in any endeavor depends directly on our ability to solve problems What do we need to do that?
  • 32. You Gotta Have Skillz…!
  • 33. Common Problem Types §  Design Problems §  Creative Problems §  Daily Problems §  People Problems Rule-Based Approach Event Based Approach
  • 34. The Problem with the Rules-Based Approach •  Solutions are driven by accepted conventions •  Best practices are coveted and are adopted without understanding how and why they were developed •  There must always be a right answer •  No logical analysis is required •  People are frequently seen as the “root cause” •  The outcomes are enforced using “re-dos” and punitive actions (or the looming threat of these things)
  • 35. Event-Based Problem Solving •  Appreciative Understanding •  Know What We Are Solving •  Create A Common Reality •  Solutions Based on Causes
  • 36.
  • 37. The Pre-Mortem Process Define the Problem Chart the Causal Relationships and Add EVidence Identify Solutions Implement the Solutions
  • 38. Step 1: Define the Problem
  • 39. Problem Definition •  What: •  When: Date/Time: Relative: what was happening at the time of this event? •  Where: Specific: Relative: logical dependencies? •  Significance: availability: environment: costs: revenue maintenance? other miscellaneous costs frequency:
  • 40. Gut Check… •  Why are we working on this? •  How much time should we spend? •  What people do we need? •  How much money should we spend? You should be able to answer all of the following:
  • 41. The What Statement •  It is used as “The Primary Effect (PE)” – It is a statement of what we want to prevent from happening again •  There may be more than one – If they are unrelated, perform separate RCA’s – If they are related and you can’t decide which to use, pick the one that is nearest to the present time •  Noun/verb statement
  • 42. Step 2: Add Causal Relationships and Evidence
  • 43. The T: Drive reached 0 Bytes free The database stopped processing queries The application server was timing out Users were getting 500 errors on the website Customers to call the helpdesk to complain Add more hard drive space Have you see something like this before? What do we really know?
  • 44. It’s never that simple Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND-
  • 45. Rules for Causal Relationships Database Down ! (Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause)! ①  Causes are effects, and effects are causes!
  • 46. Rules for Causal Relationships End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! ②  You can keep identifying causes – there is no limit!
  • 47. Two Important Questions End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! Ask “Why?” Ask “What”
  • 48. Rules for Causal Relationships ③  An Effect is often the result of multiple causes! SQL Server was not processing queries (Effect)! Transaction log was unable to grow! T: Drive at 0 Bytes free! Logs were not truncated! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Company has only 1 DBA! “Backup” DBA was not aware the logs require truncation! Space allocations are fixed! Lack of Control! -AND-! -AND-! -AND-!
  • 49. Rules for Causal Relationships ④  Causes need to be both necessary and sufficient! SQL Server was not processing queries (Effect)! Transaction log was unable to grow (Transitory Cause)! T: Drive at 0 Bytes free! (Non-transitory Cause & Effect)! Logs were not truncated! (Transitory Cause & Effect)! DBA on honeymoon vacation in Fiji! (Transitory Cause)! Logs are truncated manually! (Non-Transitory Cause)! Company has only 1 DBA! (Non-Transitory Cause)! “Backup” DBA was not aware the logs require truncation! (Non-Transitory Cause)! Space allocations are fixed! (Non-Transitory Cause)! Lack of Control! -AND-! -AND-! -AND-!
  • 50. How Fire Works Time Oxygen Heat Fuel Fire MatchStrike Transitory Non-Transitory Fire Oxygen Heat Fuel Match Strike -AND- • Transitory Causes act as catalysts to bring about change (think Transition) • Non-Transitory Causes are objects, properties/attributes, and status
  • 51. RCA Diagram Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND-
  • 52. Add Evidence Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND- Statistical Data Situational Observation
  • 53. Examples of Evidence •  Personal experience or observation •  Statistical data (Monitoring Metrics) •  Examples, particular events, or situations that illustrate •  Analogies (comparisons with similar situations) •  Informed opinion (the opinions of experts and authorities) •  Historical documentation •  Experimental evidence
  • 54. Ideas for Finding Causes Causes Management Organization Process Knowledge Technology People Information Applications Infrastructure Capital
  • 55. Step 3: Find Solutions
  • 56. Failure Modes Analysis SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are fixed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is configured to write to C: Drive Server was ASRing frequently Software distributions were leaving files in the TEMP folder %TEMP% configured to C:Temp Kernel able to write to page file -AND- -AND- -AND- -AND- -OR- -AND- -OR-
  • 57. Picking Monitors SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are fixed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is configured to write to C: Drive Server was ASRing frequently Software distributions were leaving files in the TEMP folder %TEMP% configured to C:Temp Kernel able to write to page file -AND- -AND- -AND- -AND- -OR- -AND- -OR- Monitor the intersections at the “OR’s” At least one point along each branch after the “OR”
  • 58. FMEA Matrix (Impact Calculation) Negligible (1-2): no loss in functionality, mostly cosmetic Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent Catastrophic (9-10): the system is completely unusable Improbable (1-2): less than 1 time per year Remote (3-4): 1 time per year Occasional (5-6): 1 time per month Probable (7-8): 1 time per day Chronic (9-10): 1 or more times per day Very high (1-2): during the design phase High (3-4): during peer review or unit testing Moderate (5-6): during system testing or acceptance testing Remote (7-8): during or immediately after production deployment Very Remote (9-10): only after heavy usage by users
  • 59. FMEA Matrix (Evidence) These are the events that help us to RULE IN a failure mode as a possible cause These are the events that help us RULE OUT the failure mode as not relevant
  • 60. Application-Technology Matrix Maps services, applications and technologies enabling: • Monitoring investment prioritization • Monitoring maturity • Which templates need to be deployed when new hardware is acquired • Whether an service has sufficient monitoring coverage based on its application components • This approach allows for anticipating changes to a customer’s monitoring needs Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy
  • 61. Step 4: Use this knowledge intelligently
  • 62. During Service Support •  Command Centers and Support Teams –  Use the failure modes to rule out causes –  Each failure mode will have a documented process to follow to mitigate the impact once the likely failure mode is identified •  Incident Managers –  Start bridge calls and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated –  Coordinate the investigation assignments and consolidate the investigation results
  • 63. Facilitating Production Assurance •  CritSits –  Start the CritSit meeting and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated –  Initiate investigations / experiments by assign potential failure modes to the incident response teams •  Problem Management –  Document the causal elements as new failure modes –  Disseminate new failure modes to Architecture, the Monitoring Team, and the Command Center/Service Desk •  Reporting –  Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects –  Incorporate failure modes into “Fault Line” analysis
  • 64. During the Design Process •  Architects –  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk –  Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented •  Developers –  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk –  Certify the designs implement the mitigation strategies
  • 65. Improving Enterprise Processes and Tools •  Systems Management and Monitoring –  Develop new monitoring requirements using the documented indications and contraindications •  Event Management –  Develop new correlations tying indications and contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently •  Configuration Management –  Develop new discovery patterns using the documented indications and contraindications –  Develop automations to detect the presence of failure mode conditions and generate an event to the Event Management System
  • 66. A few final thoughts…
  • 67. Running a Good Pre-Mortem Defer Judgment Encourage Wild Ideas Build on Ideas Stay Focused One Person at a Time Be Visual Go for Quantity SUCCESSFUL RCA
  • 68. Here is Why It Works RCA Process Re- Establishes Personal Relationships Social Networks Cooling-Off Period De- Escalating Gestures Confidence- Building Measures Trust Building Respect
  • 69. Don’t try to create everything at once. Knowledge is something that is created over time. Iterative Development
  • 70. Let’s keep the conversation going… Andrew.P.White@Gmail.com! ReverendDrew! SystemsManagementZen.Wordpress.com! systemsmanagementzen.wordpress.com/feed/! @SystemsMgmtZen! ReverendDrew! APWhite@us.ibm.com! 614-306-3434!