Unblocking The Main Thread Solving ANRs and Frozen Frames
The elusive root cause
1. The Elusive Root Cause Of IT Problems
And How To Easily Identify It
Noam Biran
Director of Product Management
2. Introduction
Mr. Biran
• Director of Product Management at Neebula
• 20 years experience in systems management & BSM
• Innovation Product Management at BMC
• Co-founder of Appilog (now HP uCMDB & DDMA)
About Neebula
Neebula provides the first and only automatic service-centric IT management
solution allowing IT organizations to improve the service provided to the business
by shifting from managing disparate technology silos to managing the services
running in the data center. Leveraging unique technology that automatically maps
business services to the underlying infrastructure, Neebula enables the IT team to
increase availability of the main services they manage and reduce the time to
repair of problems.
3. Agenda
• Introduction
• Root cause analysis defined
• The problem resolution process
• Problem detection
• Root cause analysis methods
• Improving root cause analysis processes
4. Root Cause Analysis Definition
ITIL V3
An Activity that identifies the Root Cause of
an Incident or Problem.
Root Cause Analysis typically concentrates on
IT Infrastructure failures.
Wikipedia
Root Cause Analysis is any structured
approach to identify the factors that resulted
in the harmful consequences of one or more
past events
5. The importance of Root Cause Analysis
• Root Cause Analysis has a high impact on
– IT processes
• The efficiency of the overall incident/problem
management process
• Good RCA discipline requires well established
configuration management
– Organizational goals
• Meeting internal and external SLAs
• Financial (budget & revenue) implications
• Brand / customer loyalty
7. The Critical Role of Root Cause Analysis
• Improper (or lack of) identification of the real
root cause may yield:
– Repeating problems
– Increased downtime
– Waste of human
resources on
“fixing” the wrong
issues
– Risk to the business
8. The Life of The Operator
We expect the operator
– To handle 1000’s of cryptic events
– Understand impact on 100’s of services
– Understand the correlation to
customers service complaints
– Understand what changed
– Orchestrate the resolution
And make these decisions within minutes to
reduce MTTR
Are we giving our operators the tools to
succeed?
10. Problem Resolution Process
• Events coming in to the NOC
• NOC performs some investigation
• Root cause analysis is shared between NOC
& 2nd/3rd level support (admins)
• Low level diagnostics & problem resolution
is done by 2nd/3rd level support (admins)
11. Involved Parties & Tools
• Tools
– Monitoring tools
– Configuration management tools
• People
– Users
– NOC
– Admins – specialized teams focused on specific
area, e.g. system, database, network
– Application support / developers
12. The Common Process – Blame Game
• No structured process
• Lack of overall cross-domain view
• Each team has its own terminology and view
• Each team is working on its own
14. Potential Problem Symptoms
• Lack of certain functionality
– A certain transaction does not work
• Performance degradation
– Fund transfer response time is above 2 sec.
• Availability issue
– Application doesn’t work
• None
– Unnoticeable failure due to high availability
configuration
15. Problem Detection
• Good problem detection methods are key for a
structured root cause analysis process
• Problem detection tools should provide sufficient
data to the root cause analysis process
• There are various distinct methods each with its
pros and cons
• There is no single superior detection method
16. Detection – Users
• What it does
– Compensates for unknown / unreported
problems
• What it doesn’t
– Supposedly accurate – actually might point in
the wrong direction
– Usually takes place
too late for a quick fix
& impact to business
17. Detection – Infrastructure Monitoring
• What it does
– Monitor each technical element
comprising the service
– Great way to identify
specific availability failures
• What it doesn’t
– Hard to correlate with real user experience
– Too many false positives
– Lots of events on symptoms rather on actual problem
18. Detection – End User Experience
• What it does
– Measure overall response time of user transactions
– Synthetic or real user transactions
– The ultimate problem detection method
• What it doesn’t
– No real breakdown to assist
in pinpointing the problem
or even the domain
19. Detection – Transaction Breakdown
• What it does
– Discovery of each transaction’s path
within the data center
– Highlight potential performance
problems within the transaction
execution
• What it doesn’t
– No correlation to infrastructure
monitoring
– Cannot cover the entire data center
– domain specific
20. Detection – Domain Specific Tools
• What it does
– Drill down in a specific application
– Great analysis & diagnostics within an application
• What it doesn’t
– No data center wide view
– Lack of insight into the
connections between
applications
23. Potential Root Cause Types
• Configuration change
• Version upgrade
• Hardware fault
• Software bug
• Capacity problem
• Resource collision
24. Common Ways for Root Cause Analysis
• War room scenario
• The log file approach
• APM tools
• Transaction management
• Manual event correlation / analysis
25. War Room Scenario
• Getting everyone in the same room
• Each has its own data and terminology
• Blame game
• Takes a lot of time
26. The Log File Approach
• An admin sits and analyzes log files and
other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
are not the root cause
(distractions)
27. APM Tools
• An admin sits and analyzes log files and
other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
are not the root cause
(distractions)
28. Transaction Management
• A great tool to point to the probable area
where the root cause resides
• Limited to specific domains
• Inability to correlate with infrastructure
metrics / failures
29. Manual Event Correlation / Analysis
• Requires cross-domain expertise
• Requires understanding of dependencies
between components
• Time consuming
• Lack of insight into other
non-event data
31. Making The Best From Existing Tools
• Choose problem detection methods that
assist in the root cause analysis process
• Turn the root cause analysis into a
structured process
– Internal team processes
– Inter-team processes
• Common language & visibility between
teams
32. New Methods: Mapping
• Mapping of Business service & applications
and the supporting infrastructure
• Ties symptoms (user) to problems
(technology)
• Introduces a common language between
teams
• Enables a high level cross-domain view
33. New Methods: Structured Process
• Define a structured process for problem
investigation and root cause analysis
• Define how collaboration should occur
during root cause analysis between teams
34. New Methods: Tools
• Use tools that provide a historical
dimension for problem investigation
• Use tools that enable the correlation of
problems to configuration changes
• Use topology based correlation instead of
rule based (or manual based) correlation
Notes de l'éditeur
Introduction to the subjectWebinar logistics: presentation first, send questions during, answer questions at the end
RCA is problematic even to defineITIL definition -> useless. ITIL failedWikipedia:StructuredFactorsConsequencesPast events – I’ll call them symptoms
Talk about each bullet
Many data sources (event feeds)All are mixed and funneled into the NOCNOC needs to filter and make order in them based on:RelevanceSource / derivedBut the NOC doesn’t have the tools or processes to do thisNo structured way to do this filtering (though the NOC is used to structured processes like run book)
Taking care of the symptoms and not the problemsAssociating wrong events -> figuring out the incorrect root cause
NOC is used to structured processes (like run book)We don’t give them toolsWe don’t give them structured processes (or any processes)They don’t posses cross-domain knowledge usually
Isolation – diagnosticsNOC’s investigation may yield forwarding to the wrong team and therefore wrong analysis done in the wrong context
Explain eachHow do they all tie together? Usually they don’t
Problem detection begins with the symptomsSame symptoms may be caused by different problems
We need a combination of toolsChoose the right mix to assist in the RCA processNeed synergy between the methods
Cross domainCross disciplineRequire deep understanding