SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
A C T I O N A B L E A L A R M M A N A G E M E N T
Alarm (Event|Network|Service) Management is a big industry but there is so little written on the
topic so we thought we would put together our ideas on what should become the main capabilities
of a modern Alarm Management system.
Alarm Management’s goal is to provide a nervous system for your network and IT infrastructure.
When something goes wrong it is detected and an alarm is processed for operators to action.The
intention is to be notified before or immediately when something goes wrong to minimise the
restoration time.The systems were deployed to collect everything that you would otherwise get
someone to sit and wait to observe happening. When networks were small, simple, less robust, less
redundant and when there was plenty of headcount this was okay but these days operations have less
people for much bigger networks. No one wants to manage alarms anymore, and quite rightly.Alarms
are just a signal that can trigger a resolution quicker than a customer notices or generates a
complaint. Alarm Management is a system that provides organisations a nervous system that with
enough effort can allow them to rectify service disruptions as soon as they appear and maybe before.
Alarms are signals which you can choose to act upon or not.What we propose is a system that
scans historical data for trends and takes operations input to provide the most actionable alarms so
that the business impacting abnormalities are not lost in the noise.
State(ful) of the Art
Our view of the alarm world.
In the Real World
Real-time is too fast.
White Swan
Tomorrow is likely to be like yesterday.
Predictable alarms and why you should ignore them, just for a little while.Trust us.
Black Swan
Unknown, unknowns. How to detect them and what to do next.
Lost Cause Analysis
Why Root Cause Analysis can become a lost cause, or just a weak excuse for bad Alarm
Management
Auto Ticketing
What you need to check for before you trigger a truck roll.
Alarms to Action
Copyright FirstOccurrence 2012-3	 	 	 	 	 	 	 	 	 v1 Final
Alarms to Filters not Filters to Alarms.
Actionable
Alarms are meant to initiate resolution action, not to be hoarded and ignored.
Alarm Management Database
Next steps to improve alarm management
State(ful) of the Art
It’s our observation that Alarm Management in IT and
Telecommunications is in a state of decay.The number and variety of
alarms has increased exponentially as new systems are integrated and
it is increasingly becoming impossible for operators to distinguish
critical events from nuisance alarms.We believe that initially the Alarm
Management systems were functional and provided fantastic value and
over time, as new alarm streams were added, no rationalisation
occurred.This occurred because the cost of adding new alarms is low and the cost of reducing alarms
is high.
We see evidence of companies willing to throw away their investment because they no longer
have the belief that their Alarm Management system is adding value. Or they want to move the alarms
to aTicketing system since that process is what will enable them to actually force them to rationalise
alarms.While we push for automation wherever possible we also think Alarm Management systems
have a big role to assist operations to trigger action (manual or automated) and quickly enable them
to detect the real root cause and escalate.
In the Real World
Real-time alarming is a major selling point for most Alarm Management systems. In fact it is an
expected and not very exciting capability these days but we have come to question the need of it
from a real-world perspective. Frequently our analysis of critical alarms found that a vast majority
(80%) would actually resolve themselves within two or three minutes.Which means to us that the
same problems just keep on happening.We can find these Alarms by looking at the MeanTime to
Resolution (MTTR).
We can use the MTTR metric as an input to suppress an alarm from presentation or further
actions until the service is given a chance to rectify itself. So if the MTTR is three minutes and the
alarm has not cleared after this period then make the alarm actionable.
So why was Real-Time at one stage such a selling point? Well the idea was that you could
decrease the MTTR or improve service levels by showing faults as fast as possible to operators to
action. Most operators are usually juggling multiple faults at once and real-time notifications are lost in
the noise.We propose that real-time alarm management is not required unless something
extraordinary occurs (aka, Black Swan).
Copyright FirstOccurrence 2012-3	 	 	 	 	 	 	 	 	 v1 Final
The cost of adding
new alarms is low
and the cost of
reducing alarms is
high.
 White Swan
Many critical alarms are so common they happen practically every day.We see these alarms all
the time. Maybe the Network Element is on a low grade network or it is suffering from a known
problem that operations cannot assist.The problem is that this alarm just keeps popping up clogging
your event list and ticketing systems and most of all taking up precious attention. Sometimes these
alarms have outages in seconds and quickly rectify themselves.We call these White Swans and they
are predictable problems that reoccur frequently but aren't really actionable.
A white swan is the predictable and the day to day and we think
they should be ignored, at least for a little while.We suggest
operations look for the mean time to resolve (MTTR) based on the
alarm type and if it is capable of self-resolution give it time before
triggering a response.
Black Swan
With alarms you will never know what will happen next.
Networks live in the real world, like life and stock markets it’s next to
impossible to predict what will happen however it’s likely that
tomorrow will be largely like yesterday. Most of alarm patterns will fall into a normal distribution
where there isn’t a huge amount of variety or volatility.
However it’s likely the biggest outage will be a surprise. Something that has never happened
before or very rarely occurs. Unfortunately with current systems it’s impossible to distinguish between
the rudimentary and the extraordinary.
There are many techniques to find the variety (colour) of the swan from your alarm patterns
however the simplest is often the best. Based on our experience we have found the following rule
matrix:
Swan Metric Indicates
White Low Mean Time to Resolve Likely to resolve itself.
Grey High Mean Time to Resolve Unlikely to resolve itself.
White Low Mean Time Between Failure Likely to resolve itself.
Grey High Mean Time Between Failure Unlikely to resolve itself. Likely to have a bigger business impact.
Black No MTBF (unknown) Unlikely to resolve itself. Likely to have a business impact.
Lost Cause Analysis
Alarm management systems are often accused of overwhelming
their users resulting in missed outages or elevated MTTR.A
frequently cited resolution to this problem is “Root cause
Analysis” (RCA). Unfortunately identification of the true root cause is
Copyright FirstOccurrence 2012-3	 	 	 	 	 	 	 	 	 v1 Final
If an alarm can trigger
an automated
resolution action it
has truly found the
root cause.
With current systems
its impossible to
distinguish from the
rudimentary and the
extraordinary.
nearly impossible with current monitoring levels. Since an alarm is almost always a consequence or a
symptom of something else that is not monitored or measurable.
We propose a test, if the alarm can trigger an automated resolution action is has truly found the
root cause.
An end-to-end root cause analysis is a worthy goal, and while possible with significant investment,
for those without the scale to invest we propose the simpler approach of giving power to human
operators to tune granular alarm life cycles.This approach will also reduce MTTR and business focus
by identifying actionable abnormalities using an Alarm Management Database (AMDB).
Auto Ticketing
Right now if there is a common goal in Alarm Management, it is auto-ticketing. Its as if the RCA
challenge has been overcome or people have just given up trying to work with alarms. If your
network has alarms that trigger a response then they should be ticketed but be careful not to just
move your problem of “too many alarms” to another interface, tickets.
So before triggering a truck roll, we suggest the following conditions should be met:
1. Alarm indicates a service disruption (defined by business attributes, transparently)
2. Another alarm that indicates that same service distribution cannot be auto ticketing, detected.
a. Using Root Cause (or alarm clusters site or hostname duplication detection)
b. Duplicate detection
3. Alarm age (time from First Occurrence) is greater than the average MTTR for that specific
alarm or an average for that alarm type
a. Unless MTBF is unknown or very high (see White/Black Swan below)
Alarms to Action
Most alarm management systems give the NOC or the operator the ability to filter alarms. The
process is generally ad-hoc and results in a fairly large SQL filter.We regard this as being filters match
alarms:
! Filter > Alarms > Operator
We think instead alarms should be assigned to filters.
! Alarm > Filter > Operator
While this doesn’t sound like there is much of a difference, there is. In the old way it is very
difficult to measure how many alarms a filter is catching.Without being able to measure it is difficult to
improve or understand the daily fluctuations that an operator experiences.This means it is hard to
know if the NOC could handle more or less.
Copyright FirstOccurrence 2012-3	 	 	 	 	 	 	 	 	 v1 Final
Therefore we recommend alarm filter attribute tagging. By giving the NOC the ability to tag
alarm-types (i.e., Interface Down) with one or more filters and assign the filter to the tag means you
can:
• Measure volume
• Model future volume with new alarm sources
• Provide easier filtering options
• Find easy targets for automation
Most Actionable
Root Cause looks for the probable source of the fault. Most Actionable is the process of
determining what faults are the most actionable according to:
• Service Impact
• Historic performance
We define an actionable alarm as an outage that impacts a service and/or requires manual
intervention to resolve. Full automated responses are not actionable until that process ends and
doesn’t provide the resolution outcome. Not all alarms require operator intervention and our
analysis's have found a majority of alarms will resolve themselves in a short enough amount of time
that manual intervention wouldn’t be possible.To provide operations focus we suggest the following
criteria that every alarm must pass before being presented or ticketed.The answer must be yes to the
following questions to warrant the raising of a ticket:
Does the alarm indicate an outage that could impact the business operation?
Can the alarmoutage resolve itself?
Is the alarm out of the ordinary (i.e has it never happened or rarely happens)
Has the alarm gone past the point of resolving itself?
Most Actionable is compatible with RCA as ideally the output from RCA is Most Actionable.
However Most Actionable can be still be used without it. For example, if someone makes a change on
a device we receive a Configuration Change alarm and then the EMS detects this was an authorised
change and generates an “Un-authorised change” alarm.The root cause is the Change event but
arguably the change alarm is the closest thing to the root cause of the Unauthorised change. So for
this example the NOC would have correctly identified the “Un-authorised change” as being
actionable and to be presented to operations.
Copyright FirstOccurrence 2012-3	 	 	 	 	 	 	 	 	 v1 Final
Introducing an Alarm Management Database
An Alarm Management (AMDB) is a system that provides operations the ability to define alarm
life cycles.An AMDB will provide visibility to what is being managed and enable users to quickly
change alarm behavior to changing network or customer needs.
We propose a AMDB has the following key capabilities:
• Provide MTTR and MTBF metrics
• Embed these values to suppress events according to the probability of “self resolution”
• Display quickly Alarm Scenarios that have never occurred previously
• Provide an interface to assign alarm lifecycle attributes
• Actionable
• AutoTicket
• Dwell
• FilterTags
• Provide an interface to assign business attributes
• Configurable granularity to AlarmType, Node, Location and Unique Alarm Identifier
Copyright FirstOccurrence 2012-3	 	 	 	 	 	 	 	 	 v1 Final
About
DeployPartners and FirstOccurrence were founded and are run by people who develop, deploy
and operate network management systems, we understand this space.
For Products or Solution Sales mentioned in this document please contact
apetzer@deploypartners.com or visit www.DeployPartners.com.
For more information on this document or content contained within please contact our CTO
DanYoung at dyoung@eirteic.com
DeployPartners deliver high-quality Service Assurance Solutions
(SAS) expertise throughout the Asia Pacific region.We specialise in
sales, design, delivery, training and support of network and service
assurance products and solutions that meet specific business objectives
and technology standards of your enterprise.This Discussion Paper was
written by our CTO DanYoung.
www.DeployPartners.com.
FirstOccurrence offers a Knowledge Base that will deliver an Alarm
Management capability (AMDB) to your existing Manager of Manager
system. Our product Context provides unique analytics and knowledge
management capabilities. FirstOccurrence was founded by people who
develop, deploy and operate network management systems, we
understand this space.
www.firstoccurrence.com.
Copyright FirstOccurrence 2012-3	 	 	 	 	 	 	 	 	 v1 Final

Contenu connexe

Tendances

Best Practices in Major Incident Management
Best Practices in Major Incident ManagementBest Practices in Major Incident Management
Best Practices in Major Incident ManagementxMatters Inc
 
Network operations center best practices (3)
Network operations center best practices (3)Network operations center best practices (3)
Network operations center best practices (3)Gabby Nizri
 
Major Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey ReportMajor Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey ReportxMatters Inc
 
Getting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paperGetting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paperTawnia Beckwith
 
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...VAST
 
5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System24/7 Software
 

Tendances (7)

Best Practices in Major Incident Management
Best Practices in Major Incident ManagementBest Practices in Major Incident Management
Best Practices in Major Incident Management
 
Network operations center best practices (3)
Network operations center best practices (3)Network operations center best practices (3)
Network operations center best practices (3)
 
Major Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey ReportMajor Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey Report
 
Failure Reporting Process Map
Failure Reporting Process MapFailure Reporting Process Map
Failure Reporting Process Map
 
Getting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paperGetting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paper
 
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
 
5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System
 

Similaire à Actionable Alarm Management

Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)Siglos
 
Flo-Tech E-book-Avoiding Device Failure
Flo-Tech E-book-Avoiding Device FailureFlo-Tech E-book-Avoiding Device Failure
Flo-Tech E-book-Avoiding Device FailureThomas Clifford
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsJan Henry Nystrom
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call Raygun
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Emerging technologies enabling in fraud detection
Emerging technologies enabling in fraud detectionEmerging technologies enabling in fraud detection
Emerging technologies enabling in fraud detectionUmasree Raghunath
 
Normal accidents and outpatient surgeries
Normal accidents and outpatient surgeriesNormal accidents and outpatient surgeries
Normal accidents and outpatient surgeriesJonathan Creasy
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentMike Julian
 
White paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider ElectricWhite paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider ElectricSuman Singh
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Brian Troutwine
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
A People's History of Microservices
A People's History of MicroservicesA People's History of Microservices
A People's History of MicroservicesCamille Fournier
 
Results before and after dynamic alarm mgt emerson
Results before and after dynamic alarm mgt   emersonResults before and after dynamic alarm mgt   emerson
Results before and after dynamic alarm mgt emersonMary Claire Simoneaux
 
Saving One Network At a Time
Saving One Network At a TimeSaving One Network At a Time
Saving One Network At a TimeJeffrey Ong
 

Similaire à Actionable Alarm Management (20)

Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Flo-Tech E-book-Avoiding Device Failure
Flo-Tech E-book-Avoiding Device FailureFlo-Tech E-book-Avoiding Device Failure
Flo-Tech E-book-Avoiding Device Failure
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang Applications
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Emerging technologies enabling in fraud detection
Emerging technologies enabling in fraud detectionEmerging technologies enabling in fraud detection
Emerging technologies enabling in fraud detection
 
Normal accidents and outpatient surgeries
Normal accidents and outpatient surgeriesNormal accidents and outpatient surgeries
Normal accidents and outpatient surgeries
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
 
Alarm Management_NKS
Alarm Management_NKSAlarm Management_NKS
Alarm Management_NKS
 
White paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider ElectricWhite paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider Electric
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
Agents vs Agentless
Agents vs AgentlessAgents vs Agentless
Agents vs Agentless
 
A People's History of Microservices
A People's History of MicroservicesA People's History of Microservices
A People's History of Microservices
 
Results before and after dynamic alarm mgt emerson
Results before and after dynamic alarm mgt   emersonResults before and after dynamic alarm mgt   emerson
Results before and after dynamic alarm mgt emerson
 
Ayehu eyeShare Overview
Ayehu eyeShare OverviewAyehu eyeShare Overview
Ayehu eyeShare Overview
 
Saving One Network At a Time
Saving One Network At a TimeSaving One Network At a Time
Saving One Network At a Time
 

Dernier

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Dernier (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Actionable Alarm Management

  • 1. A C T I O N A B L E A L A R M M A N A G E M E N T Alarm (Event|Network|Service) Management is a big industry but there is so little written on the topic so we thought we would put together our ideas on what should become the main capabilities of a modern Alarm Management system. Alarm Management’s goal is to provide a nervous system for your network and IT infrastructure. When something goes wrong it is detected and an alarm is processed for operators to action.The intention is to be notified before or immediately when something goes wrong to minimise the restoration time.The systems were deployed to collect everything that you would otherwise get someone to sit and wait to observe happening. When networks were small, simple, less robust, less redundant and when there was plenty of headcount this was okay but these days operations have less people for much bigger networks. No one wants to manage alarms anymore, and quite rightly.Alarms are just a signal that can trigger a resolution quicker than a customer notices or generates a complaint. Alarm Management is a system that provides organisations a nervous system that with enough effort can allow them to rectify service disruptions as soon as they appear and maybe before. Alarms are signals which you can choose to act upon or not.What we propose is a system that scans historical data for trends and takes operations input to provide the most actionable alarms so that the business impacting abnormalities are not lost in the noise. State(ful) of the Art Our view of the alarm world. In the Real World Real-time is too fast. White Swan Tomorrow is likely to be like yesterday. Predictable alarms and why you should ignore them, just for a little while.Trust us. Black Swan Unknown, unknowns. How to detect them and what to do next. Lost Cause Analysis Why Root Cause Analysis can become a lost cause, or just a weak excuse for bad Alarm Management Auto Ticketing What you need to check for before you trigger a truck roll. Alarms to Action Copyright FirstOccurrence 2012-3 v1 Final
  • 2. Alarms to Filters not Filters to Alarms. Actionable Alarms are meant to initiate resolution action, not to be hoarded and ignored. Alarm Management Database Next steps to improve alarm management State(ful) of the Art It’s our observation that Alarm Management in IT and Telecommunications is in a state of decay.The number and variety of alarms has increased exponentially as new systems are integrated and it is increasingly becoming impossible for operators to distinguish critical events from nuisance alarms.We believe that initially the Alarm Management systems were functional and provided fantastic value and over time, as new alarm streams were added, no rationalisation occurred.This occurred because the cost of adding new alarms is low and the cost of reducing alarms is high. We see evidence of companies willing to throw away their investment because they no longer have the belief that their Alarm Management system is adding value. Or they want to move the alarms to aTicketing system since that process is what will enable them to actually force them to rationalise alarms.While we push for automation wherever possible we also think Alarm Management systems have a big role to assist operations to trigger action (manual or automated) and quickly enable them to detect the real root cause and escalate. In the Real World Real-time alarming is a major selling point for most Alarm Management systems. In fact it is an expected and not very exciting capability these days but we have come to question the need of it from a real-world perspective. Frequently our analysis of critical alarms found that a vast majority (80%) would actually resolve themselves within two or three minutes.Which means to us that the same problems just keep on happening.We can find these Alarms by looking at the MeanTime to Resolution (MTTR). We can use the MTTR metric as an input to suppress an alarm from presentation or further actions until the service is given a chance to rectify itself. So if the MTTR is three minutes and the alarm has not cleared after this period then make the alarm actionable. So why was Real-Time at one stage such a selling point? Well the idea was that you could decrease the MTTR or improve service levels by showing faults as fast as possible to operators to action. Most operators are usually juggling multiple faults at once and real-time notifications are lost in the noise.We propose that real-time alarm management is not required unless something extraordinary occurs (aka, Black Swan). Copyright FirstOccurrence 2012-3 v1 Final The cost of adding new alarms is low and the cost of reducing alarms is high.
  • 3.  White Swan Many critical alarms are so common they happen practically every day.We see these alarms all the time. Maybe the Network Element is on a low grade network or it is suffering from a known problem that operations cannot assist.The problem is that this alarm just keeps popping up clogging your event list and ticketing systems and most of all taking up precious attention. Sometimes these alarms have outages in seconds and quickly rectify themselves.We call these White Swans and they are predictable problems that reoccur frequently but aren't really actionable. A white swan is the predictable and the day to day and we think they should be ignored, at least for a little while.We suggest operations look for the mean time to resolve (MTTR) based on the alarm type and if it is capable of self-resolution give it time before triggering a response. Black Swan With alarms you will never know what will happen next. Networks live in the real world, like life and stock markets it’s next to impossible to predict what will happen however it’s likely that tomorrow will be largely like yesterday. Most of alarm patterns will fall into a normal distribution where there isn’t a huge amount of variety or volatility. However it’s likely the biggest outage will be a surprise. Something that has never happened before or very rarely occurs. Unfortunately with current systems it’s impossible to distinguish between the rudimentary and the extraordinary. There are many techniques to find the variety (colour) of the swan from your alarm patterns however the simplest is often the best. Based on our experience we have found the following rule matrix: Swan Metric Indicates White Low Mean Time to Resolve Likely to resolve itself. Grey High Mean Time to Resolve Unlikely to resolve itself. White Low Mean Time Between Failure Likely to resolve itself. Grey High Mean Time Between Failure Unlikely to resolve itself. Likely to have a bigger business impact. Black No MTBF (unknown) Unlikely to resolve itself. Likely to have a business impact. Lost Cause Analysis Alarm management systems are often accused of overwhelming their users resulting in missed outages or elevated MTTR.A frequently cited resolution to this problem is “Root cause Analysis” (RCA). Unfortunately identification of the true root cause is Copyright FirstOccurrence 2012-3 v1 Final If an alarm can trigger an automated resolution action it has truly found the root cause. With current systems its impossible to distinguish from the rudimentary and the extraordinary.
  • 4. nearly impossible with current monitoring levels. Since an alarm is almost always a consequence or a symptom of something else that is not monitored or measurable. We propose a test, if the alarm can trigger an automated resolution action is has truly found the root cause. An end-to-end root cause analysis is a worthy goal, and while possible with significant investment, for those without the scale to invest we propose the simpler approach of giving power to human operators to tune granular alarm life cycles.This approach will also reduce MTTR and business focus by identifying actionable abnormalities using an Alarm Management Database (AMDB). Auto Ticketing Right now if there is a common goal in Alarm Management, it is auto-ticketing. Its as if the RCA challenge has been overcome or people have just given up trying to work with alarms. If your network has alarms that trigger a response then they should be ticketed but be careful not to just move your problem of “too many alarms” to another interface, tickets. So before triggering a truck roll, we suggest the following conditions should be met: 1. Alarm indicates a service disruption (defined by business attributes, transparently) 2. Another alarm that indicates that same service distribution cannot be auto ticketing, detected. a. Using Root Cause (or alarm clusters site or hostname duplication detection) b. Duplicate detection 3. Alarm age (time from First Occurrence) is greater than the average MTTR for that specific alarm or an average for that alarm type a. Unless MTBF is unknown or very high (see White/Black Swan below) Alarms to Action Most alarm management systems give the NOC or the operator the ability to filter alarms. The process is generally ad-hoc and results in a fairly large SQL filter.We regard this as being filters match alarms: ! Filter > Alarms > Operator We think instead alarms should be assigned to filters. ! Alarm > Filter > Operator While this doesn’t sound like there is much of a difference, there is. In the old way it is very difficult to measure how many alarms a filter is catching.Without being able to measure it is difficult to improve or understand the daily fluctuations that an operator experiences.This means it is hard to know if the NOC could handle more or less. Copyright FirstOccurrence 2012-3 v1 Final
  • 5. Therefore we recommend alarm filter attribute tagging. By giving the NOC the ability to tag alarm-types (i.e., Interface Down) with one or more filters and assign the filter to the tag means you can: • Measure volume • Model future volume with new alarm sources • Provide easier filtering options • Find easy targets for automation Most Actionable Root Cause looks for the probable source of the fault. Most Actionable is the process of determining what faults are the most actionable according to: • Service Impact • Historic performance We define an actionable alarm as an outage that impacts a service and/or requires manual intervention to resolve. Full automated responses are not actionable until that process ends and doesn’t provide the resolution outcome. Not all alarms require operator intervention and our analysis's have found a majority of alarms will resolve themselves in a short enough amount of time that manual intervention wouldn’t be possible.To provide operations focus we suggest the following criteria that every alarm must pass before being presented or ticketed.The answer must be yes to the following questions to warrant the raising of a ticket: Does the alarm indicate an outage that could impact the business operation? Can the alarmoutage resolve itself? Is the alarm out of the ordinary (i.e has it never happened or rarely happens) Has the alarm gone past the point of resolving itself? Most Actionable is compatible with RCA as ideally the output from RCA is Most Actionable. However Most Actionable can be still be used without it. For example, if someone makes a change on a device we receive a Configuration Change alarm and then the EMS detects this was an authorised change and generates an “Un-authorised change” alarm.The root cause is the Change event but arguably the change alarm is the closest thing to the root cause of the Unauthorised change. So for this example the NOC would have correctly identified the “Un-authorised change” as being actionable and to be presented to operations. Copyright FirstOccurrence 2012-3 v1 Final
  • 6. Introducing an Alarm Management Database An Alarm Management (AMDB) is a system that provides operations the ability to define alarm life cycles.An AMDB will provide visibility to what is being managed and enable users to quickly change alarm behavior to changing network or customer needs. We propose a AMDB has the following key capabilities: • Provide MTTR and MTBF metrics • Embed these values to suppress events according to the probability of “self resolution” • Display quickly Alarm Scenarios that have never occurred previously • Provide an interface to assign alarm lifecycle attributes • Actionable • AutoTicket • Dwell • FilterTags • Provide an interface to assign business attributes • Configurable granularity to AlarmType, Node, Location and Unique Alarm Identifier Copyright FirstOccurrence 2012-3 v1 Final
  • 7. About DeployPartners and FirstOccurrence were founded and are run by people who develop, deploy and operate network management systems, we understand this space. For Products or Solution Sales mentioned in this document please contact apetzer@deploypartners.com or visit www.DeployPartners.com. For more information on this document or content contained within please contact our CTO DanYoung at dyoung@eirteic.com DeployPartners deliver high-quality Service Assurance Solutions (SAS) expertise throughout the Asia Pacific region.We specialise in sales, design, delivery, training and support of network and service assurance products and solutions that meet specific business objectives and technology standards of your enterprise.This Discussion Paper was written by our CTO DanYoung. www.DeployPartners.com. FirstOccurrence offers a Knowledge Base that will deliver an Alarm Management capability (AMDB) to your existing Manager of Manager system. Our product Context provides unique analytics and knowledge management capabilities. FirstOccurrence was founded by people who develop, deploy and operate network management systems, we understand this space. www.firstoccurrence.com. Copyright FirstOccurrence 2012-3 v1 Final