SlideShare une entreprise Scribd logo
1  sur  35
The Elusive Root Cause Of IT Problems
And How To Easily Identify It


Noam Biran
Director of Product Management
Introduction
               Mr. Biran
               •    Director of Product Management at Neebula
               •    20 years experience in systems management & BSM
               •    Innovation Product Management at BMC
               •    Co-founder of Appilog (now HP uCMDB & DDMA)



 About Neebula
  Neebula provides the first and only automatic service-centric IT management
  solution allowing IT organizations to improve the service provided to the business
  by shifting from managing disparate technology silos to managing the services
  running in the data center. Leveraging unique technology that automatically maps
  business services to the underlying infrastructure, Neebula enables the IT team to
  increase availability of the main services they manage and reduce the time to
  repair of problems.
Agenda
•   Introduction
•   Root cause analysis defined
•   The problem resolution process
•   Problem detection
•   Root cause analysis methods
•   Improving root cause analysis processes
Root Cause Analysis Definition
   ITIL V3
              An Activity that identifies the Root Cause of
              an Incident or Problem.
              Root Cause Analysis typically concentrates on
              IT Infrastructure failures.



  Wikipedia
              Root Cause Analysis is any structured
              approach to identify the factors that resulted
              in the harmful consequences of one or more
              past events
The importance of Root Cause Analysis
• Root Cause Analysis has a high impact on
  – IT processes
     • The efficiency of the overall incident/problem
       management process
     • Good RCA discipline requires well established
       configuration management
  – Organizational goals
     • Meeting internal and external SLAs
     • Financial (budget & revenue) implications
     • Brand / customer loyalty
Root Cause Analysis Nowadays
The Critical Role of Root Cause Analysis
• Improper (or lack of) identification of the real
  root cause may yield:
   – Repeating problems
   – Increased downtime
   – Waste of human
     resources on
     “fixing” the wrong
     issues
   – Risk to the business
The Life of The Operator
We expect the operator
    – To handle 1000’s of cryptic events
    – Understand impact on 100’s of services
    – Understand the correlation to
       customers service complaints
    – Understand what changed
    – Orchestrate the resolution
And make these decisions within minutes to
reduce MTTR

   Are we giving our operators the tools to
   succeed?
Problem Resolution Process
Problem Resolution Process
• Events coming in to the NOC
• NOC performs some investigation
• Root cause analysis is shared between NOC
  & 2nd/3rd level support (admins)
• Low level diagnostics & problem resolution
  is done by 2nd/3rd level support (admins)
Involved Parties & Tools

• Tools
  – Monitoring tools
  – Configuration management tools
• People
  – Users
  – NOC
  – Admins – specialized teams focused on specific
    area, e.g. system, database, network
  – Application support / developers
The Common Process – Blame Game
•   No structured process
•   Lack of overall cross-domain view
•   Each team has its own terminology and view
•   Each team is working on its own
Problem Detection
Potential Problem Symptoms
• Lack of certain functionality
  – A certain transaction does not work
• Performance degradation
  – Fund transfer response time is above 2 sec.
• Availability issue
  – Application doesn’t work
• None
  – Unnoticeable failure due to high availability
    configuration
Problem Detection
• Good problem detection methods are key for a
  structured root cause analysis process
• Problem detection tools should provide sufficient
  data to the root cause analysis process
• There are various distinct methods each with its
  pros and cons
• There is no single superior detection method
Detection – Users
• What it does
  – Compensates for unknown / unreported
    problems
• What it doesn’t
  – Supposedly accurate – actually might point in
    the wrong direction
  – Usually takes place
    too late for a quick fix
    & impact to business
Detection – Infrastructure Monitoring
• What it does
  – Monitor each technical element
    comprising the service
  – Great way to identify
    specific availability failures
• What it doesn’t
  – Hard to correlate with real user experience
  – Too many false positives
  – Lots of events on symptoms rather on actual problem
Detection – End User Experience
• What it does
  – Measure overall response time of user transactions
  – Synthetic or real user transactions
  – The ultimate problem detection method
• What it doesn’t
  – No real breakdown to assist
    in pinpointing the problem
    or even the domain
Detection – Transaction Breakdown
• What it does
  – Discovery of each transaction’s path
    within the data center
  – Highlight potential performance
    problems within the transaction
    execution
• What it doesn’t
  – No correlation to infrastructure
    monitoring
  – Cannot cover the entire data center
    – domain specific
Detection – Domain Specific Tools
• What it does
  – Drill down in a specific application
  – Great analysis & diagnostics within an application
• What it doesn’t
  – No data center wide view
  – Lack of insight into the
    connections between
    applications
Detection - Synergy
Root Cause Analysis Methods
Potential Root Cause Types

•   Configuration change
•   Version upgrade
•   Hardware fault
•   Software bug
•   Capacity problem
•   Resource collision
Common Ways for Root Cause Analysis

•   War room scenario
•   The log file approach
•   APM tools
•   Transaction management
•   Manual event correlation / analysis
War Room Scenario

•   Getting everyone in the same room
•   Each has its own data and terminology
•   Blame game
•   Takes a lot of time
The Log File Approach

• An admin sits and analyzes log files and
  other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
  are not the root cause
  (distractions)
APM Tools

• An admin sits and analyzes log files and
  other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
  are not the root cause
  (distractions)
Transaction Management

• A great tool to point to the probable area
  where the root cause resides
• Limited to specific domains
• Inability to correlate with infrastructure
  metrics / failures
Manual Event Correlation / Analysis

• Requires cross-domain expertise
• Requires understanding of dependencies
  between components
• Time consuming
• Lack of insight into other
  non-event data
Improving Root Cause Analysis
          Processes
Making The Best From Existing Tools

• Choose problem detection methods that
  assist in the root cause analysis process
• Turn the root cause analysis into a
  structured process
  – Internal team processes
  – Inter-team processes
• Common language & visibility between
  teams
New Methods: Mapping

• Mapping of Business service & applications
  and the supporting infrastructure
• Ties symptoms (user) to problems
  (technology)
• Introduces a common language between
  teams
• Enables a high level cross-domain view
New Methods: Structured Process

• Define a structured process for problem
  investigation and root cause analysis
• Define how collaboration should occur
  during root cause analysis between teams
New Methods: Tools

• Use tools that provide a historical
  dimension for problem investigation
• Use tools that enable the correlation of
  problems to configuration changes
• Use topology based correlation instead of
  rule based (or manual based) correlation
The elusive root cause

Contenu connexe

Tendances

Alexander Rhea Resume
Alexander Rhea ResumeAlexander Rhea Resume
Alexander Rhea ResumeAlex Rhea
 
Requirements elicitation techniques
Requirements elicitation techniquesRequirements elicitation techniques
Requirements elicitation techniquesTeniola Alimi
 
Requirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/MethodsRequirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/MethodsSUFYAN SATTAR
 
Chapter 7 Development Strategies
Chapter 7 Development StrategiesChapter 7 Development Strategies
Chapter 7 Development StrategiesMeryl C
 
Financial Crime Projects
Financial Crime ProjectsFinancial Crime Projects
Financial Crime ProjectsDavid Allsop
 
Chapter 2 analyzing the business case
Chapter 2 analyzing the business caseChapter 2 analyzing the business case
Chapter 2 analyzing the business caseRaquel Miranda
 
Systems Analysis
Systems AnalysisSystems Analysis
Systems AnalysisBli Wilson
 
Non functional requirements. do we really care…?
Non functional requirements. do we really care…?Non functional requirements. do we really care…?
Non functional requirements. do we really care…?OSSCube
 
Design for non functional requirements
Design for non functional requirementsDesign for non functional requirements
Design for non functional requirementsHabeeb Mahaboob
 
Requirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineeringRequirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineeringsnehalkulkarni74
 
Requirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and ElicitationRequirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and ElicitationMohamed Shaaban
 
Intoduction to software engineering part 1
Intoduction to software engineering part 1Intoduction to software engineering part 1
Intoduction to software engineering part 1Rupesh Vaishnav
 

Tendances (17)

Alexander Rhea Resume
Alexander Rhea ResumeAlexander Rhea Resume
Alexander Rhea Resume
 
Sadchap04
Sadchap04Sadchap04
Sadchap04
 
Requirements elicitation techniques
Requirements elicitation techniquesRequirements elicitation techniques
Requirements elicitation techniques
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
Requirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/MethodsRequirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/Methods
 
Chapter 7 Development Strategies
Chapter 7 Development StrategiesChapter 7 Development Strategies
Chapter 7 Development Strategies
 
Financial Crime Projects
Financial Crime ProjectsFinancial Crime Projects
Financial Crime Projects
 
Chapter 2 analyzing the business case
Chapter 2 analyzing the business caseChapter 2 analyzing the business case
Chapter 2 analyzing the business case
 
Systems Analysis
Systems AnalysisSystems Analysis
Systems Analysis
 
Non functional requirements. do we really care…?
Non functional requirements. do we really care…?Non functional requirements. do we really care…?
Non functional requirements. do we really care…?
 
Design for non functional requirements
Design for non functional requirementsDesign for non functional requirements
Design for non functional requirements
 
Chapter 03
Chapter 03Chapter 03
Chapter 03
 
Requirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineeringRequirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineering
 
Requirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and ElicitationRequirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and Elicitation
 
Intoduction to software engineering part 1
Intoduction to software engineering part 1Intoduction to software engineering part 1
Intoduction to software engineering part 1
 
2 feasibility-study
2 feasibility-study2 feasibility-study
2 feasibility-study
 
Network Operations Center
Network Operations Center  Network Operations Center
Network Operations Center
 

Similaire à The elusive root cause

requirements analysis and design
requirements analysis and designrequirements analysis and design
requirements analysis and designPreeti Mishra
 
Requirement Analysis
Requirement AnalysisRequirement Analysis
Requirement AnalysisSADEED AMEEN
 
lecture_Analysis Phase.ppt
lecture_Analysis Phase.pptlecture_Analysis Phase.ppt
lecture_Analysis Phase.pptAteeqaKokab1
 
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjdlecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjdAqeelAbbas94
 
Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)Jana Gierloff
 
software requirement
software requirement software requirement
software requirement nimmik4u
 
Requirements engineering process in software engineering
Requirements engineering process in software engineeringRequirements engineering process in software engineering
Requirements engineering process in software engineeringPreeti Mishra
 
Best practice for_agile_ds_projects
Best practice for_agile_ds_projectsBest practice for_agile_ds_projects
Best practice for_agile_ds_projectsKhalid Kahloot
 
req engg (1).ppt
req engg (1).pptreq engg (1).ppt
req engg (1).pptWaniHBisen
 
Top 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksTop 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksAppDynamics
 
Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12Trevor Warren
 
INTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERINGINTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERINGPreeti Mishra
 
What is onTune for management
What is onTune for managementWhat is onTune for management
What is onTune for managementTeemStone Pty Ltd
 
Software quality assurance
Software quality assuranceSoftware quality assurance
Software quality assuranceRizky Munggaran
 
1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptx1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptxMadhusudhanaSubraman
 
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptxUNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptxabhiisharma0504
 

Similaire à The elusive root cause (20)

requirements analysis and design
requirements analysis and designrequirements analysis and design
requirements analysis and design
 
Requirement Analysis
Requirement AnalysisRequirement Analysis
Requirement Analysis
 
lecture_Analysis Phase.ppt
lecture_Analysis Phase.pptlecture_Analysis Phase.ppt
lecture_Analysis Phase.ppt
 
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjdlecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
 
Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)
 
software requirement
software requirement software requirement
software requirement
 
Chapter 12 developiong business&it solutions
Chapter 12  developiong business&it solutionsChapter 12  developiong business&it solutions
Chapter 12 developiong business&it solutions
 
Development Guideline
Development GuidelineDevelopment Guideline
Development Guideline
 
Requirements engineering process in software engineering
Requirements engineering process in software engineeringRequirements engineering process in software engineering
Requirements engineering process in software engineering
 
Best practice for_agile_ds_projects
Best practice for_agile_ds_projectsBest practice for_agile_ds_projects
Best practice for_agile_ds_projects
 
req engg (1).ppt
req engg (1).pptreq engg (1).ppt
req engg (1).ppt
 
Top 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksTop 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & Tricks
 
Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12
 
INTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERINGINTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERING
 
What is onTune for management
What is onTune for managementWhat is onTune for management
What is onTune for management
 
Proj Mgmt.ppt
Proj Mgmt.pptProj Mgmt.ppt
Proj Mgmt.ppt
 
Software quality assurance
Software quality assuranceSoftware quality assurance
Software quality assurance
 
1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptx1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptx
 
Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
 
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptxUNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
 

Dernier

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Dernier (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

The elusive root cause

  • 1. The Elusive Root Cause Of IT Problems And How To Easily Identify It Noam Biran Director of Product Management
  • 2. Introduction Mr. Biran • Director of Product Management at Neebula • 20 years experience in systems management & BSM • Innovation Product Management at BMC • Co-founder of Appilog (now HP uCMDB & DDMA) About Neebula Neebula provides the first and only automatic service-centric IT management solution allowing IT organizations to improve the service provided to the business by shifting from managing disparate technology silos to managing the services running in the data center. Leveraging unique technology that automatically maps business services to the underlying infrastructure, Neebula enables the IT team to increase availability of the main services they manage and reduce the time to repair of problems.
  • 3. Agenda • Introduction • Root cause analysis defined • The problem resolution process • Problem detection • Root cause analysis methods • Improving root cause analysis processes
  • 4. Root Cause Analysis Definition ITIL V3 An Activity that identifies the Root Cause of an Incident or Problem. Root Cause Analysis typically concentrates on IT Infrastructure failures. Wikipedia Root Cause Analysis is any structured approach to identify the factors that resulted in the harmful consequences of one or more past events
  • 5. The importance of Root Cause Analysis • Root Cause Analysis has a high impact on – IT processes • The efficiency of the overall incident/problem management process • Good RCA discipline requires well established configuration management – Organizational goals • Meeting internal and external SLAs • Financial (budget & revenue) implications • Brand / customer loyalty
  • 7. The Critical Role of Root Cause Analysis • Improper (or lack of) identification of the real root cause may yield: – Repeating problems – Increased downtime – Waste of human resources on “fixing” the wrong issues – Risk to the business
  • 8. The Life of The Operator We expect the operator – To handle 1000’s of cryptic events – Understand impact on 100’s of services – Understand the correlation to customers service complaints – Understand what changed – Orchestrate the resolution And make these decisions within minutes to reduce MTTR Are we giving our operators the tools to succeed?
  • 10. Problem Resolution Process • Events coming in to the NOC • NOC performs some investigation • Root cause analysis is shared between NOC & 2nd/3rd level support (admins) • Low level diagnostics & problem resolution is done by 2nd/3rd level support (admins)
  • 11. Involved Parties & Tools • Tools – Monitoring tools – Configuration management tools • People – Users – NOC – Admins – specialized teams focused on specific area, e.g. system, database, network – Application support / developers
  • 12. The Common Process – Blame Game • No structured process • Lack of overall cross-domain view • Each team has its own terminology and view • Each team is working on its own
  • 14. Potential Problem Symptoms • Lack of certain functionality – A certain transaction does not work • Performance degradation – Fund transfer response time is above 2 sec. • Availability issue – Application doesn’t work • None – Unnoticeable failure due to high availability configuration
  • 15. Problem Detection • Good problem detection methods are key for a structured root cause analysis process • Problem detection tools should provide sufficient data to the root cause analysis process • There are various distinct methods each with its pros and cons • There is no single superior detection method
  • 16. Detection – Users • What it does – Compensates for unknown / unreported problems • What it doesn’t – Supposedly accurate – actually might point in the wrong direction – Usually takes place too late for a quick fix & impact to business
  • 17. Detection – Infrastructure Monitoring • What it does – Monitor each technical element comprising the service – Great way to identify specific availability failures • What it doesn’t – Hard to correlate with real user experience – Too many false positives – Lots of events on symptoms rather on actual problem
  • 18. Detection – End User Experience • What it does – Measure overall response time of user transactions – Synthetic or real user transactions – The ultimate problem detection method • What it doesn’t – No real breakdown to assist in pinpointing the problem or even the domain
  • 19. Detection – Transaction Breakdown • What it does – Discovery of each transaction’s path within the data center – Highlight potential performance problems within the transaction execution • What it doesn’t – No correlation to infrastructure monitoring – Cannot cover the entire data center – domain specific
  • 20. Detection – Domain Specific Tools • What it does – Drill down in a specific application – Great analysis & diagnostics within an application • What it doesn’t – No data center wide view – Lack of insight into the connections between applications
  • 23. Potential Root Cause Types • Configuration change • Version upgrade • Hardware fault • Software bug • Capacity problem • Resource collision
  • 24. Common Ways for Root Cause Analysis • War room scenario • The log file approach • APM tools • Transaction management • Manual event correlation / analysis
  • 25. War Room Scenario • Getting everyone in the same room • Each has its own data and terminology • Blame game • Takes a lot of time
  • 26. The Log File Approach • An admin sits and analyzes log files and other historical data from various sources • A domain specific approach • Certain degree of structured process • Might identify problems that are not the root cause (distractions)
  • 27. APM Tools • An admin sits and analyzes log files and other historical data from various sources • A domain specific approach • Certain degree of structured process • Might identify problems that are not the root cause (distractions)
  • 28. Transaction Management • A great tool to point to the probable area where the root cause resides • Limited to specific domains • Inability to correlate with infrastructure metrics / failures
  • 29. Manual Event Correlation / Analysis • Requires cross-domain expertise • Requires understanding of dependencies between components • Time consuming • Lack of insight into other non-event data
  • 30. Improving Root Cause Analysis Processes
  • 31. Making The Best From Existing Tools • Choose problem detection methods that assist in the root cause analysis process • Turn the root cause analysis into a structured process – Internal team processes – Inter-team processes • Common language & visibility between teams
  • 32. New Methods: Mapping • Mapping of Business service & applications and the supporting infrastructure • Ties symptoms (user) to problems (technology) • Introduces a common language between teams • Enables a high level cross-domain view
  • 33. New Methods: Structured Process • Define a structured process for problem investigation and root cause analysis • Define how collaboration should occur during root cause analysis between teams
  • 34. New Methods: Tools • Use tools that provide a historical dimension for problem investigation • Use tools that enable the correlation of problems to configuration changes • Use topology based correlation instead of rule based (or manual based) correlation

Notes de l'éditeur

  1. Introduction to the subjectWebinar logistics: presentation first, send questions during, answer questions at the end
  2. RCA is problematic even to defineITIL definition -> useless. ITIL failedWikipedia:StructuredFactorsConsequencesPast events – I’ll call them symptoms
  3. Talk about each bullet
  4. Many data sources (event feeds)All are mixed and funneled into the NOCNOC needs to filter and make order in them based on:RelevanceSource / derivedBut the NOC doesn’t have the tools or processes to do thisNo structured way to do this filtering (though the NOC is used to structured processes like run book)
  5. Taking care of the symptoms and not the problemsAssociating wrong events -> figuring out the incorrect root cause
  6. NOC is used to structured processes (like run book)We don’t give them toolsWe don’t give them structured processes (or any processes)They don’t posses cross-domain knowledge usually
  7. Isolation – diagnosticsNOC’s investigation may yield forwarding to the wrong team and therefore wrong analysis done in the wrong context
  8. Explain eachHow do they all tie together? Usually they don’t
  9. Problem detection begins with the symptomsSame symptoms may be caused by different problems
  10. We need a combination of toolsChoose the right mix to assist in the RCA processNeed synergy between the methods
  11. Cross domainCross disciplineRequire deep understanding
  12. Not a structured approach