1. Monitoring
Considerations
Monitorama, 2013
John Allspaw
SVP, Technical Operations
Sunday, August 4, 13
I want to warn you that I will lift references from various sources this morning, and I’ll make
sure to point to those further readings I’ll touch on when I post slides.
You can feel free to view those readings as HOMEWORK. Unsurprisingly to anyone who knows
me, a large amount of them will be in the field of Human Factors and Safety.
WHO HERE HAS EVER WRITTEN MONITORING SOFTWARE? (alerts, dashboards, graphs, metrics
collection, analysis, display, etc.)
2. “In the long term, Operations as a science
needs to be elevated.”
Chris Brown
Velocity London, 2012
Sunday, August 4, 13
We are at an interesting time in our field.
We are still naive.
We express indignation in terse remarks about our challenges.
We also believe that certainty is something we can attain through the use of technology alone.
This makes the field of web engineering as a whole ADORABLE.
3. Dr. Richard Cook, Velocity US 2012
http://www.youtube.com/watch?v=R_PDc0HFdP0
Sunday, August 4, 13
Dr. Cook explains how the research done in Human Factors and Systems Safety has a good relevance to the
operation of web infrastructures.
“Anytime you find a world in which you have high consequences, high-tempo operations, time pressure, and
lots of complexity...and people are called upon to manage that, you’re going to have these kinds of issues
arise.”
Aviation, patient safety, military, power generation and distribution, space travel, etc.....they
are attractive because we see something in them that is familiar.
While we have an opportunity to take ADVANTAGE of LESSONS LEARNED in other fields of
high-tempo/complexity/consequences, it behooves us to think on how we are DIFFERENT
from the other fields.
We also have an opportunity to SIDESTEP some of the quagmires those fields have found
themselves in.
This talk is a tiny effort towards this direction.
4. LANGUAGE
Sunday, August 4, 13
In order to support this, I will argue that we need to start paying attention to our language.
1. OTHER DOMAINS ALREADY HAVE A LEXICON, WE CAN BORROW SOME TERMS FROM THEM
2. How we discuss our challenges can play a very large role in how we surmount them.
There are a number of concepts, words, and ideas that need to enter our lexicon, especially
when it comes to monitoring and the challenges that come with making sense of where, what,
how, and why complex systems behave.
5. BETTER
QUESTIONS
Sunday, August 4, 13
One of the OTHER things that has become clear to me is that as a field, we need to ask
BETTER QUESTIONS instead of quickly jumping to CORRECT ANSWERS or SOLUTIONS.
ASKING TERRIBLE QUESTIONS WILL GUARANTEE TERRIBLE SOLUTIONS.
I’m increasingly convinced that the road to progress on such a broad and complicated topic
as monitoring is paved with BETTER QUESTIONS, not NEWER TOOLS.
So you may hear me asking some questions today.
They may or may not be good questions, but I’ll take a stab at it anyway.
6. DOWN and IN
Sunday, August 4, 13
“Down and In”
As the years go by and we see the continued decline of storage prices, the explosion of
accessible processing power, we have an ever-expanding ability to zoom in deeply to the
ways servers and services talk to each other and process information.
WE CAN ZOOM IN ON THE RELATIONSHIPS and BEHAVIORS of SEEMINGLY DISPARATE PIECES
OF DATA...
... AND WE CAN DISCOVER AND DETECT DISRUPTIONS IN SOMETIMES SURPRISING PLACES.
THIS IS INTERESTING.
BUT IT IS ALSO WOEFULLY INCOMPLETE IF WE ARE TO MAKE ANY PROGRESS IN OPERATIONS.
7. UP and OUT
Sunday, August 4, 13
...it is INCOMPLETE because as we ZOOM OUT, what we find is a much-ignored environment
which includes one of the most powerful CONTEXT-SENSITIVE and INCREDIBLY ADAPTIVE
anomaly detection and response agent in the world:
HUMANS
8. Sunday, August 4, 13
Do we have ANOMALY DETECTION problems? Certainly. One can argue (I will, if you’d like,
later at the bar) that we will ALWAYS have them.
BUT: What I’m interested in is NOT how software can be used to detect anomalies
automatically.
(well, I’m interested, but I don’t doubt that you all will continue to get better at it)
9. Sunday, August 4, 13
... It is how people navigate this boundary between themselves and the machines they work
with.
The BOUNDARY between humans and machines, as we observe our use of tools, is a focus IN
and OF ITSELF.
If we have any hope of making progress in monitoring complex systems, we must take this
boundary into account.
10. Sunday, August 4, 13
BUT ABOUT HUMANS: A couple of observations with respect to tools and monitoring
in general.
1. We don’t use a single tool to gain insight into the architectures we build. And we
will not.
2. Teams of people are the NORM, which means communication and coordination
become as important (if not more important) than surfacing anomalies themselves.
3. We bring our BIASES, EXPECTATIONS, TRUST, and PERCEPTIONS to the table. No
tool or piece of automation or tooling will change that.
4. Understanding the breakdowns at these boundaries between people and machines
should be a part of how we approach design of tools and organizational behaviors.
12. OODALoop
Observe Orient Decide Act
credit:http://blog.b3k.us/ooda.html
Sunday, August 4, 13
WHO IS FAMILIAR WITH Lt. Boyd’s OODA Loop?
Observation and orientation is a place where we can look for making progress.
When we get alerted, look at dashboards, graphs and logs, we’re looking to make sense of
the past and project into the future.
NOTE: Observe and Orient are not Unix commands, they are HUMAN ACTIVITIES.
13. We need to understand how people make sense of
what is going on
Sunday, August 4, 13
SO: Writing code to TELL COMPUTERS WHAT TO LOOK AT is quite different than making sure
that the code’s human supervisors are equipped or aided in what to look at when an alert
goes off.
How people make sense of what is going on (in diagnosis? In planning? In response? In
control?) is just plain HARD.
14. We need to understand how normal
work is getting done by normal people
in normal situations.
Sunday, August 4, 13
If we don’t understand how people consume, adapt to, work around, and make use of tools
under “normal” operating conditions, how can we have confidence that our designs will
perform under uncertain or escalating scenarios?
15. Work As Imagined
Work As Done
Sunday, August 4, 13
Our clues on how we THINK we work guides our design decisions.
But there is a gap between how we think we work, and how we actually work.
How large is this gap? How will we know when it’s too large?
16. Where is design?
“The system should therefore be designed so
that human adaptation is ENHANCED.”
Erik Hollnagel
Expertise and Technology: Cognition & Human-Computer
Cooperation, 1995
Sunday, August 4, 13
Design thought should be in tools, displays, controls, and processes.
What do we have to work with, though?
“It is the expertise of the human operator
that makes it possible to adapt the
performance of the joint system, in real
time, to unexpected events and
disturbances. Every working day, across the
whole spectrum of human enterprise, a large
number of near-misses are prevented from
turning into accidents only because human
operators intervene...
17. Sunday, August 4, 13
Whether we know it or not, we are ALL designers now, if we build tools intended to aid
monitoring.
I’m not just talking about UI and garden-variety HCI work, but those topics should be
considered table stakes.
18. Where is design?
http://www.perceptualedge.com/articles/visual_business_intelligence/
time_on_the_horizon.pdf
Sunday, August 4, 13
VISUAL PERCEPTIONS and UI approaches are integral to our field, so we should try to
understand them as deeply as we can.
Armed with the knowledge that every element of design can (and will) be mis-used (like these
Horizon Graphs), we are left with a dilemma:
How can we understand what can augment human capabilities without getting in the way,
and without having to first re-start our career as an Human Factors expert?
WE FAKE IT UNTIL WE MAKE IT
21. Principles of Display Design
• Principle of information need
• Principle of legibility
• Principle of display integration/proximity
• Principle of pictorial realism
• Principle of the moving part
• Principle of predictive aiding
• Principle of discriminability: status versus command
Wickens, Lee, Liu, Becker
An Introduction to Human Factors Engineering
Sunday, August 4, 13
Here is another great pointer on display design, from “AN INTRODUCTION TO HUMAN
FACTORS ENGINEERING”.
22. Cognition In The Wild
“It is notoriously difficult to generalize
laboratory findings to real-world
situations.”
Sunday, August 4, 13
So let’s leave design for a moment and talk about how we can VALIDATE our design choices.
We CANNOT hope to understand how people behave in real-world scenarios BY USING OUR
IMAGINATION alone.
How many of you work at a company where funnel or clickstream analysis is being done?
How many of you have done clickstream or funnel analysis on your monitoring dashboards, graphs,
and displays?
What sort of information might we find when we gather data on how people navigate metric data
during varying scenarios?
23. ALERT
DESIGN
Sunday, August 4, 13
- Who has ever gotten a page and ignored it?
Endsley: At a safety expert conference, in a 300-person hall, only 3 people got up for a fire alarm.
- How many alerts were received in the past week that were not actionable? (no human action was
required?)
- How many alerts were received in the past week as a result of known work being done, but alerts
were not silenced during that period?
- How many alerts were received as a result of a previously silenced alert (because work was being
done) that was mistakenly un-silenced?
24. Jack Garman
Flight controller
NASA Mission Control
Apollo Program (Murray and Cox 1990)
Sunday, August 4, 13
“A program alarm could be triggered by trivial problems that
could be ignored altogether.
Or it could be triggered by problems that called for an immediate
abort.
How to decide which was which?
"We wrote ourselves little rules like
'If this alarm happens and it only happens once, don't worry
about it. If it happens repeatedly, but other indicators are okay,
don't worry about it.'"
25. Operator, interviewed.
The Three Mile Island
nuclear power plant, following the
accident. (Kemeny 1979)
Sunday, August 4, 13
“I would have liked to have thrown away the alarm panel. It
wasn't giving us any useful information."
Comment by one operator at the Three Mile Island nuclear
power plant
to the official inquiry following the TMI accident (Kemeny 1979).
26. Physician, explaining how they
respond to a nuisance alarm on a
device in the operating room.
(Cook, Potter,Woods and McDonald 1991)
Sunday, August 4, 13
"When the alarm kept going off then we kept shutting it [the
device] off [and on] and when the alarm would go off [again],
we’d shut it off.”
“... so I just reset it [a device control] to a higher temperature. So
I kinda fooled it [the alarm]...”
27. SIGNAL
DETECTION
THEORY
Sunday, August 4, 13
Signal Detection Theory
- Too sensitive, and you’ll get false alarms
- Not sensitive enough, and you’ll get missed alarms
28. ALERT DESIGN
Mica Endsley
Designing for Situation Awareness
Sunday, August 4, 13
What about the context people are in when they
experience a FALSE ALERT?
Or a MISSED ALERT?
29. Interpretation
Integration
Interpretation
Other Situational
Information
Expectancies Past History Mental Model
Alarm Signal
Response
Decision
Designing for Situational Awareness, Mica Endsley
Sunday, August 4, 13
The cognitive processing of an alarm signal.
When we DESIGN ALERTS, we HAVE to think about the
various ways that the ALERT could be interpreted or
acted on. Often times, we will PUNT on aiding the
operator with CONTEXT.
30. Critical Care & Anesthesiology
• Monitors & alarms designed to “never miss”
• 566 deaths reported related to alarms
(2005-2008)
• Most associate with the silencing function
• ECRI’s #1 health technology hazard, 2012 & 2013
And you have complaints about Nagios’ “set downtime” feature?
Sunday, August 4, 13
Emergency Care Research Institute (ECRI), which recently
identified alarms as the “number one health technology hazard”
for 2012.9
And you have complaints about Nagios’ “set downtime” feature?
31. ALERT DESIGN
Confirmation
Sunday, August 4, 13
- Because false alarms are a problem, people will spend time not
reacting to an alert, but confirming that the alert is legit.
- Pilots delay responding to GPWS (Ground Proximity Warning
System) 73% of the time, because they’re looking out the window
to confirm it’s true, and how true it is.
What are ways we can SUPPORT CONFIRMATION or
VALIDATION in our alert design?
32. ALERT DESIGN
Expectancy
Sunday, August 4, 13
- People’s expectancies can also affect their interpretation of alerts.
- In many cases, people EXPECT the alert to go off, as the result of their own actions.
- In a study in 2001, 6% of operating room alarms were found to be expected or anticipated.
- This can become a nuisance, and further degrade the trust in the alerts.
- Example: disk space alerts that happen during a backup, and then recover.
- Example: someone on the team doing work, and not silencing the alerts temporarily.
BONUS: when the time period for an alert is silenced passes, and the condition isn’t acceptable yet.
(downtime expiring)
What are ways that we could SUPPORT EXPECTANCY in our alert design?
33. ALERT DESIGN
• Signal:Noise can be difficult
• Easy to err on more false alarms
• Decay in trust
• Origins: Undetectable conditions
Sunday, August 4, 13
- Signal:Noise can be difficult to get right
- General view: err on the side of too many false alarms. This ignores the detrimental effect
of them on humans.
- Study in 1998 said: New ATC systems, missed alerts at 0.2%, false alarm rates at 65%.
- Underlying false alerts: not the functioning of algorithms themselves, but the CONDITIONS
AND FACTORS THAT THE ALARM SYSTEMS CANNOT DETECT OR INTERPRET
Ex: Cincinnati Airport - riverbank leading up to a runway increases in terrain causes an alarm
because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with
the airport ignore the alarms.
34. Information is not a
scarce resource.
Attention is.
Herb Simon, 1991
Sunday, August 4, 13
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
35. Directed Attention
• Attention focusing
• Attention switching
• Dynamic Prioritization
Sunday, August 4, 13
We work in a COGNITIVELY NOISY WORLD, even when there is NOT an outage going on.
Alerts are ESSENTIALLY ATTENTION DIRECTORS.
The main challenge for DYNAMIC FAULT MANAGEMENT (HF term) in design is to support:
- ATTENTION FOCUSING
- ATTENTION SWITCHING
- DYNAMIC PRIORITIZATION
By getting to know how human attention works (and its relationship to context, perception,
etc.), we can hope to design better alerts.
36. Interrupts AND
Underspecification
1. “Here is the data I want you to see”
2. “Here is why I think you would find it interesting”
Sunday, August 4, 13
An alert is essentially an INTERRUPT.
TWO STATES:
1 - HERE IS THE DATA I WANTYOU TO SEE
2 - HERE IS WHY I THINKYOU WOULD FIND IT INTERESTING
What can we do to support #2?
37. Paradox
Of
Directed Attention
Sunday, August 4, 13
An alert is essentially an interruption to everyday work, and there is a paradox at the heart of
DIRECTED ATTENTION.
1. We are always busy!
2. Shifting attention has a very real cost!
2. Not all signals are worth paying attention to; context-sensitivity will always vary
3. So how can you SKILLFULLY IGNORE a SIGNAL that should NOT SHIFT UR ATTENTION
WITHOUT first processing it....IN WHICH CASE IT HASN’T BEEN IGNORED.
“Given that the supervisory agent is loaded by various other task related demands, how does one
interpret information about the potential need to switch attentional focus without interrupting or
interfering with the tasks or lines of reasoning already under attentional control. We can state this
paradox in another way: how can one skillfully ignore a signal that should not shift attention within
the current context, without first processing it -- in which case it hasn't been ignored.” - David
Woods
David Woods has suggested some ways to break this paradox, he calls it PREATTENTIVE
REFERENCE.
I’ll let you discover his suggestions on your own.
38. Directed Attention
Sorting through
an avalanche of
data
Picking up on
subtle early
indications of a
fault
Sunday, August 4, 13
This idea of an alert DIRECTING OUR ATTENTION can exist in two views:
SORTING THROUGH AN AVALANCHE or PICKING UP SUBTLE/EARLY INDICATIONS....
So....which is it?
IT CAN BE BOTH!
“The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data -- a data overload problem. This
is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the
background of a quiescent monitored process.”
39. Context Sensitivity
Sunday, August 4, 13
The background and context in which a SIGNAL arrives can play a huge role in how they can
HELP or HINDER us.
If the background is one of QUIET, contrast is HIGH. <- this is what most designers plan for
If the background is ONGOING DIAGNOSIS, then SIGNAL can SUPPORT/CONTRADICT existing
hypothesis
If the background is EXECUTING A RESPONSE, then SIGNAL can cue the RESPONSE is WRONG
or INCOMPLETE.
In any case, the ALERT’s MEANING will change as CONTEXT and BACKGROUND changes.
40. Data Overload
Sunday, August 4, 13
This is simply a tough problem.
There are approaches to solve it, but none of them to date are effective given the rate at
which new pieces of data are being collected and stored.
There is a significant agreement among those who study data overload phenomena that the
critical piece to understand is of CONTEXT SENSITIVITY.
Some HF researchers have pointed at something that may help reduce the effects of DO:
Depicting RELATIONSHIPS between data in a known FRAME of REFERENCE, as opposed to the
raw data.
What can we do as designers to aid surfacing those relationships?
41. How have I taken the
OPERATOR into account?
Sunday, August 4, 13
PEOPLE use monitoring tools.
Arguably, MACHINES use monitoring tools we build, as well.
But only PEOPLE can adapt and improvise with a given tool outside of the original intentions
of its designer.
42. Am I hurting or helping:
•Data overload or underload?
•Salience?
•Directed attention?
•Interruptibility?
Sunday, August 4, 13
When we design alerts and monitoring tools, we should be asking these questions.
In addition: HOW WILL WE KNOW WHEN THIS DESIGN WOULD HURT those things?
43. Joint Cognitive Systems
Sunday, August 4, 13
One final thought: what if, instead of the view that the BOUNDARY is a large barrier to be
hurdled only by our writing increasingly complex code...we view that boundary as a place for
an actual cooperative RELATIONSHIP?
44. Joint Cognitive Systems
What if we viewed an alerting system
as a PARTNER, instead of a subordinate?
Sunday, August 4, 13
What is we viewed alerting systems as a PARTNER, instead of a subordinate or otherwise
dumb messenger delivering news to us?
What does the world look like if we designed alerts to COOPERATE with us?
If TRUST in alerting systems is such a big deal....
WHAT can we learn from how HUMANS learn to trust each other, and let that influence our
design decisions?
In other words: how can we design alerts that SUPPORT our confirming their legitimacy, or
our expectations when an alert will fire? Is context-sensitivity part of this?
We see some blunt versions of these notions:
1 - Time periods for alerts, so that people aren’t woken up for things that can wait until
morning (the machine has been given some context about our availability to pay attention to
an alert)
2 - Rough dependency relationships, so we don’t send a bazillion alerts when a known SPOF
dies
What other examples can we think of, where the COMPUTERS can attempt to understand,
predict, or observe US, as we work?
45. The End
Sunday, August 4, 13
My hope is that I’ve been able to ask BETTER QUESTIONS, and I can kick off this conference
with food for thought.
You can tell me how that food tastes at the bar later.
46. Can We Ever Escape From Data
Overload?
A Cognitive Systems Diagnosis
Woods, Patterson, Roth 1999
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
Sunday, August 4, 13
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
47. The Alarm Problem and
Directed Attention in
Dynamic Fault Management
Woods 1995
http://csel.eng.ohio-state.edu/woods/foundations/directed%20att.pdf
Sunday, August 4, 13
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf