1. A C T I O N A B L E A L A R M M A N A G E M E N T
Alarm (Event|Network|Service) Management is a big industry but there is so little written on the
topic so we thought we would put together our ideas on what should become the main capabilities
of a modern Alarm Management system.
Alarm Management’s goal is to provide a nervous system for your network and IT infrastructure.
When something goes wrong it is detected and an alarm is processed for operators to action.The
intention is to be notified before or immediately when something goes wrong to minimise the
restoration time.The systems were deployed to collect everything that you would otherwise get
someone to sit and wait to observe happening. When networks were small, simple, less robust, less
redundant and when there was plenty of headcount this was okay but these days operations have less
people for much bigger networks. No one wants to manage alarms anymore, and quite rightly.Alarms
are just a signal that can trigger a resolution quicker than a customer notices or generates a
complaint. Alarm Management is a system that provides organisations a nervous system that with
enough effort can allow them to rectify service disruptions as soon as they appear and maybe before.
Alarms are signals which you can choose to act upon or not.What we propose is a system that
scans historical data for trends and takes operations input to provide the most actionable alarms so
that the business impacting abnormalities are not lost in the noise.
State(ful) of the Art
Our view of the alarm world.
In the Real World
Real-time is too fast.
White Swan
Tomorrow is likely to be like yesterday.
Predictable alarms and why you should ignore them, just for a little while.Trust us.
Black Swan
Unknown, unknowns. How to detect them and what to do next.
Lost Cause Analysis
Why Root Cause Analysis can become a lost cause, or just a weak excuse for bad Alarm
Management
Auto Ticketing
What you need to check for before you trigger a truck roll.
Alarms to Action
Copyright FirstOccurrence 2012-3 v1 Final
2. Alarms to Filters not Filters to Alarms.
Actionable
Alarms are meant to initiate resolution action, not to be hoarded and ignored.
Alarm Management Database
Next steps to improve alarm management
State(ful) of the Art
It’s our observation that Alarm Management in IT and
Telecommunications is in a state of decay.The number and variety of
alarms has increased exponentially as new systems are integrated and
it is increasingly becoming impossible for operators to distinguish
critical events from nuisance alarms.We believe that initially the Alarm
Management systems were functional and provided fantastic value and
over time, as new alarm streams were added, no rationalisation
occurred.This occurred because the cost of adding new alarms is low and the cost of reducing alarms
is high.
We see evidence of companies willing to throw away their investment because they no longer
have the belief that their Alarm Management system is adding value. Or they want to move the alarms
to aTicketing system since that process is what will enable them to actually force them to rationalise
alarms.While we push for automation wherever possible we also think Alarm Management systems
have a big role to assist operations to trigger action (manual or automated) and quickly enable them
to detect the real root cause and escalate.
In the Real World
Real-time alarming is a major selling point for most Alarm Management systems. In fact it is an
expected and not very exciting capability these days but we have come to question the need of it
from a real-world perspective. Frequently our analysis of critical alarms found that a vast majority
(80%) would actually resolve themselves within two or three minutes.Which means to us that the
same problems just keep on happening.We can find these Alarms by looking at the MeanTime to
Resolution (MTTR).
We can use the MTTR metric as an input to suppress an alarm from presentation or further
actions until the service is given a chance to rectify itself. So if the MTTR is three minutes and the
alarm has not cleared after this period then make the alarm actionable.
So why was Real-Time at one stage such a selling point? Well the idea was that you could
decrease the MTTR or improve service levels by showing faults as fast as possible to operators to
action. Most operators are usually juggling multiple faults at once and real-time notifications are lost in
the noise.We propose that real-time alarm management is not required unless something
extraordinary occurs (aka, Black Swan).
Copyright FirstOccurrence 2012-3 v1 Final
The cost of adding
new alarms is low
and the cost of
reducing alarms is
high.
3. White Swan
Many critical alarms are so common they happen practically every day.We see these alarms all
the time. Maybe the Network Element is on a low grade network or it is suffering from a known
problem that operations cannot assist.The problem is that this alarm just keeps popping up clogging
your event list and ticketing systems and most of all taking up precious attention. Sometimes these
alarms have outages in seconds and quickly rectify themselves.We call these White Swans and they
are predictable problems that reoccur frequently but aren't really actionable.
A white swan is the predictable and the day to day and we think
they should be ignored, at least for a little while.We suggest
operations look for the mean time to resolve (MTTR) based on the
alarm type and if it is capable of self-resolution give it time before
triggering a response.
Black Swan
With alarms you will never know what will happen next.
Networks live in the real world, like life and stock markets it’s next to
impossible to predict what will happen however it’s likely that
tomorrow will be largely like yesterday. Most of alarm patterns will fall into a normal distribution
where there isn’t a huge amount of variety or volatility.
However it’s likely the biggest outage will be a surprise. Something that has never happened
before or very rarely occurs. Unfortunately with current systems it’s impossible to distinguish between
the rudimentary and the extraordinary.
There are many techniques to find the variety (colour) of the swan from your alarm patterns
however the simplest is often the best. Based on our experience we have found the following rule
matrix:
Swan Metric Indicates
White Low Mean Time to Resolve Likely to resolve itself.
Grey High Mean Time to Resolve Unlikely to resolve itself.
White Low Mean Time Between Failure Likely to resolve itself.
Grey High Mean Time Between Failure Unlikely to resolve itself. Likely to have a bigger business impact.
Black No MTBF (unknown) Unlikely to resolve itself. Likely to have a business impact.
Lost Cause Analysis
Alarm management systems are often accused of overwhelming
their users resulting in missed outages or elevated MTTR.A
frequently cited resolution to this problem is “Root cause
Analysis” (RCA). Unfortunately identification of the true root cause is
Copyright FirstOccurrence 2012-3 v1 Final
If an alarm can trigger
an automated
resolution action it
has truly found the
root cause.
With current systems
its impossible to
distinguish from the
rudimentary and the
extraordinary.
4. nearly impossible with current monitoring levels. Since an alarm is almost always a consequence or a
symptom of something else that is not monitored or measurable.
We propose a test, if the alarm can trigger an automated resolution action is has truly found the
root cause.
An end-to-end root cause analysis is a worthy goal, and while possible with significant investment,
for those without the scale to invest we propose the simpler approach of giving power to human
operators to tune granular alarm life cycles.This approach will also reduce MTTR and business focus
by identifying actionable abnormalities using an Alarm Management Database (AMDB).
Auto Ticketing
Right now if there is a common goal in Alarm Management, it is auto-ticketing. Its as if the RCA
challenge has been overcome or people have just given up trying to work with alarms. If your
network has alarms that trigger a response then they should be ticketed but be careful not to just
move your problem of “too many alarms” to another interface, tickets.
So before triggering a truck roll, we suggest the following conditions should be met:
1. Alarm indicates a service disruption (defined by business attributes, transparently)
2. Another alarm that indicates that same service distribution cannot be auto ticketing, detected.
a. Using Root Cause (or alarm clusters site or hostname duplication detection)
b. Duplicate detection
3. Alarm age (time from First Occurrence) is greater than the average MTTR for that specific
alarm or an average for that alarm type
a. Unless MTBF is unknown or very high (see White/Black Swan below)
Alarms to Action
Most alarm management systems give the NOC or the operator the ability to filter alarms. The
process is generally ad-hoc and results in a fairly large SQL filter.We regard this as being filters match
alarms:
! Filter > Alarms > Operator
We think instead alarms should be assigned to filters.
! Alarm > Filter > Operator
While this doesn’t sound like there is much of a difference, there is. In the old way it is very
difficult to measure how many alarms a filter is catching.Without being able to measure it is difficult to
improve or understand the daily fluctuations that an operator experiences.This means it is hard to
know if the NOC could handle more or less.
Copyright FirstOccurrence 2012-3 v1 Final
5. Therefore we recommend alarm filter attribute tagging. By giving the NOC the ability to tag
alarm-types (i.e., Interface Down) with one or more filters and assign the filter to the tag means you
can:
• Measure volume
• Model future volume with new alarm sources
• Provide easier filtering options
• Find easy targets for automation
Most Actionable
Root Cause looks for the probable source of the fault. Most Actionable is the process of
determining what faults are the most actionable according to:
• Service Impact
• Historic performance
We define an actionable alarm as an outage that impacts a service and/or requires manual
intervention to resolve. Full automated responses are not actionable until that process ends and
doesn’t provide the resolution outcome. Not all alarms require operator intervention and our
analysis's have found a majority of alarms will resolve themselves in a short enough amount of time
that manual intervention wouldn’t be possible.To provide operations focus we suggest the following
criteria that every alarm must pass before being presented or ticketed.The answer must be yes to the
following questions to warrant the raising of a ticket:
Does the alarm indicate an outage that could impact the business operation?
Can the alarmoutage resolve itself?
Is the alarm out of the ordinary (i.e has it never happened or rarely happens)
Has the alarm gone past the point of resolving itself?
Most Actionable is compatible with RCA as ideally the output from RCA is Most Actionable.
However Most Actionable can be still be used without it. For example, if someone makes a change on
a device we receive a Configuration Change alarm and then the EMS detects this was an authorised
change and generates an “Un-authorised change” alarm.The root cause is the Change event but
arguably the change alarm is the closest thing to the root cause of the Unauthorised change. So for
this example the NOC would have correctly identified the “Un-authorised change” as being
actionable and to be presented to operations.
Copyright FirstOccurrence 2012-3 v1 Final
6. Introducing an Alarm Management Database
An Alarm Management (AMDB) is a system that provides operations the ability to define alarm
life cycles.An AMDB will provide visibility to what is being managed and enable users to quickly
change alarm behavior to changing network or customer needs.
We propose a AMDB has the following key capabilities:
• Provide MTTR and MTBF metrics
• Embed these values to suppress events according to the probability of “self resolution”
• Display quickly Alarm Scenarios that have never occurred previously
• Provide an interface to assign alarm lifecycle attributes
• Actionable
• AutoTicket
• Dwell
• FilterTags
• Provide an interface to assign business attributes
• Configurable granularity to AlarmType, Node, Location and Unique Alarm Identifier
Copyright FirstOccurrence 2012-3 v1 Final
7. About
DeployPartners and FirstOccurrence were founded and are run by people who develop, deploy
and operate network management systems, we understand this space.
For Products or Solution Sales mentioned in this document please contact
apetzer@deploypartners.com or visit www.DeployPartners.com.
For more information on this document or content contained within please contact our CTO
DanYoung at dyoung@eirteic.com
DeployPartners deliver high-quality Service Assurance Solutions
(SAS) expertise throughout the Asia Pacific region.We specialise in
sales, design, delivery, training and support of network and service
assurance products and solutions that meet specific business objectives
and technology standards of your enterprise.This Discussion Paper was
written by our CTO DanYoung.
www.DeployPartners.com.
FirstOccurrence offers a Knowledge Base that will deliver an Alarm
Management capability (AMDB) to your existing Manager of Manager
system. Our product Context provides unique analytics and knowledge
management capabilities. FirstOccurrence was founded by people who
develop, deploy and operate network management systems, we
understand this space.
www.firstoccurrence.com.
Copyright FirstOccurrence 2012-3 v1 Final