SlideShare a Scribd company logo
1 of 34
Download to read offline
Finger Pointing
    Mahendra Kutare
 mahendra@boundary.com
   twitter - @imaxxs
FingerPointing ?

FingerPointing is a way through
w h ic h h u m a n s co m m u n icate
emotions of urgency, surprise, joy,
acknowle dgment, achievement,
blame, frustration, fear and more.
FingerPointing ?




Some do it with one..
                    Some need two..
FingerPointing ?




Some do it with one..
                        Some need two..
Systems FingerPointing ?




    Some do it everywhere...
Human Computer FingerPointing ?




        Some do it with....
Systems Control Loop
           Time to Collect
 Monitor                     Collect
                Info

                                  Time to Detect/Analyze
                 Act

           Time to Recover
 Recover                     Analysis


  Local                        Global
Systems Control Loop
           Time to Collect
  Meter                      Collector

                                   Time to Detect/Analyze


           Time to Recover
 Recover                      Engine


  Local                         Global
Problem Determination

Detection - Identifies violations or
anomalies.
Diagnosis - Analyzes violations or
anomalies.
Remediation - Recovers the
system to normal state
Detection

Threshold
Signature
Anomaly
Detection
Thresholds - Matching single value/predicate.

Signature - Matching faults with known fault
signatures. It can detect a set of know faults.

Anomalies - Learn to recognize the normal
runtime behavior. It can detect previously
unseen faults.
Aniketos
 No use of statistical machine learning.

 Uses computational geometry - convex hull.

 Convex hull - Encompassing shape around a
 group of points.

 Works independent of whether metrics are
 correlated or not.


Stehle, Lynch et.al ICAC 2010
Fault Detection
Training Phase

No one knows when enough training data is
collected.

If a system has an extensive test suite, that
represents normal behavior, then execution
of the test suite will produce a good training
dataset.

Replay request logs of production system on
test system.
Bounded Box Example
Given two metrics A and B, if the safe range of A
is 5 to 10 and B is 10 to 20 the normal behavior of
the system can be represented as 2D rectangle
with vertices (5,10), (5,20), (10,20) and (10,10)

Any datapoint that falls within that rectangle, for
example (7,15), is classified as normal.

Any datapoint that falls outside of the rectangle,
for example (15,15) is classified as anomalous.
Detection Phase
Egress/Ingress Data




volume_1s_meter_ip query, 6000 data points
Egress/Ingress Data




volume_1s_meter_ip query, 150,000 data points
Fault Detection Comparison




Maximum fault coverage, tradeoff false positives
Diagnosis

Dependency Inference
Correlation Analysis
Peer Analysis
E2EProf
Useful for debugging distributed systems of black boxes.




         Sandeep et. al DSN 2007
Service Paths

Client requests take different “paths” through the
software invoking dynamic dependencies across
distributed systems. Ensemble of paths taken by
client requests - “Service Paths”

Key idea - Convert message traces per service
node to per edge signals and compute cross
correlations of these signals.
Path Discovery
A request path VC1->VS1->VS2->VS4

Collect timestamp, source/dest ip at each VS
node.

Calculates cross correlation between time
series signals across VS nodes.

If cross correlation has a spike at a phase
lag = latency between nodes, there exists a
path/edge between VS nodes.
App Vis




   Network topology view
Augment with “service paths” ??
Remediation
Software Rejuvenation for Software Aging

  Reactive - Reboots, Micro Reboots

  Proactive - Time or load based

Checkpointing and Recovery

Treating bugs as allergies
Software Aging

Patriot missiles, used during the Gulf war, to
destroy Iraq’s Scud missile used a computer
who software accu mu late d er rors i.e
software aging.

The effect of aging in this case was mis-
interpretation of an incoming Scud as not a
missile but just a false alarm, which resulted
in death of 28 US soldiers.
Software Rejuvenation

Periodic preemptive rollback of continuously running
applications to prevent failures in the future.

Open - Not based on feedback from the system -
Elapsed Time, Cumulative jobs in system

Closed - Based on some notion of system health.
Continuously monitor, analyze the estimated time to
exhaustion of a resource.


    Trivedi et. al Duke University.
Apache Web Server
MaxRequestPerChild - If this value is set
to a positive value, then the parent
process of Apache kills a child process as
soon as MaxRequestsPerChild        request
have been handled by this child process.

By doing this, Apache limits “the amount
of memory a process can consume by
accidental memory leak”and “helps reduce
the num of process when server load
reduces.”
Treating Bugs as Allergies

 Inspired by allergy treatment in real life. If
 you are allergic to milk, remove dairy
 products from your diet.

 Rollback the program to a recent checkpoint
 when a bug is detected, dynamically change
 the execution environment based on failure
 symptoms, and then re-execute the program
 in modified environment.

     Quin et. al SOSP 2005
Treating Bugs As Allergies
Examples

Uninitialized reads may be avoided if every
newly allocated buffer is filled with zeros.

Data races can be avoided by changing time
related event such as thread scheduling,
asynchronous events.
Environment Changes
Comparison of Rx and
       Alternative Approaches




For systems where reboot ~5sec is not good enough
   Checkpoint, Replay bounded by reboot ~5sec
Finger pointing

More Related Content

Similar to Finger pointing

Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
Argyle Executive Forum
 
Ch20-Software Engineering 9
Ch20-Software Engineering 9Ch20-Software Engineering 9
Ch20-Software Engineering 9
Ian Sommerville
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identification
Sebastian W. Cheah
 

Similar to Finger pointing (20)

Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
 
Performance testing and rpt
Performance testing and rptPerformance testing and rpt
Performance testing and rpt
 
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine LearningIRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
 
Sa03 tactics
Sa03 tacticsSa03 tactics
Sa03 tactics
 
Ch20-Software Engineering 9
Ch20-Software Engineering 9Ch20-Software Engineering 9
Ch20-Software Engineering 9
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologies
 
Performance testing basics
Performance testing basicsPerformance testing basics
Performance testing basics
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive Application
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
 
Vissec2014
Vissec2014Vissec2014
Vissec2014
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems design
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems design
 
System Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationSystem Event Monitoring for Active Authentication
System Event Monitoring for Active Authentication
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identification
 
Software testing overview subbu
Software testing overview subbuSoftware testing overview subbu
Software testing overview subbu
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang Applications
 
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solution
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 

Finger pointing

  • 1. Finger Pointing Mahendra Kutare mahendra@boundary.com twitter - @imaxxs
  • 2. FingerPointing ? FingerPointing is a way through w h ic h h u m a n s co m m u n icate emotions of urgency, surprise, joy, acknowle dgment, achievement, blame, frustration, fear and more.
  • 3. FingerPointing ? Some do it with one.. Some need two..
  • 4. FingerPointing ? Some do it with one.. Some need two..
  • 5. Systems FingerPointing ? Some do it everywhere...
  • 6. Human Computer FingerPointing ? Some do it with....
  • 7. Systems Control Loop Time to Collect Monitor Collect Info Time to Detect/Analyze Act Time to Recover Recover Analysis Local Global
  • 8. Systems Control Loop Time to Collect Meter Collector Time to Detect/Analyze Time to Recover Recover Engine Local Global
  • 9. Problem Determination Detection - Identifies violations or anomalies. Diagnosis - Analyzes violations or anomalies. Remediation - Recovers the system to normal state
  • 11. Detection Thresholds - Matching single value/predicate. Signature - Matching faults with known fault signatures. It can detect a set of know faults. Anomalies - Learn to recognize the normal runtime behavior. It can detect previously unseen faults.
  • 12. Aniketos No use of statistical machine learning. Uses computational geometry - convex hull. Convex hull - Encompassing shape around a group of points. Works independent of whether metrics are correlated or not. Stehle, Lynch et.al ICAC 2010
  • 14. Training Phase No one knows when enough training data is collected. If a system has an extensive test suite, that represents normal behavior, then execution of the test suite will produce a good training dataset. Replay request logs of production system on test system.
  • 15. Bounded Box Example Given two metrics A and B, if the safe range of A is 5 to 10 and B is 10 to 20 the normal behavior of the system can be represented as 2D rectangle with vertices (5,10), (5,20), (10,20) and (10,10) Any datapoint that falls within that rectangle, for example (7,15), is classified as normal. Any datapoint that falls outside of the rectangle, for example (15,15) is classified as anomalous.
  • 19. Fault Detection Comparison Maximum fault coverage, tradeoff false positives
  • 21. E2EProf Useful for debugging distributed systems of black boxes. Sandeep et. al DSN 2007
  • 22. Service Paths Client requests take different “paths” through the software invoking dynamic dependencies across distributed systems. Ensemble of paths taken by client requests - “Service Paths” Key idea - Convert message traces per service node to per edge signals and compute cross correlations of these signals.
  • 23. Path Discovery A request path VC1->VS1->VS2->VS4 Collect timestamp, source/dest ip at each VS node. Calculates cross correlation between time series signals across VS nodes. If cross correlation has a spike at a phase lag = latency between nodes, there exists a path/edge between VS nodes.
  • 24. App Vis Network topology view Augment with “service paths” ??
  • 25. Remediation Software Rejuvenation for Software Aging Reactive - Reboots, Micro Reboots Proactive - Time or load based Checkpointing and Recovery Treating bugs as allergies
  • 26. Software Aging Patriot missiles, used during the Gulf war, to destroy Iraq’s Scud missile used a computer who software accu mu late d er rors i.e software aging. The effect of aging in this case was mis- interpretation of an incoming Scud as not a missile but just a false alarm, which resulted in death of 28 US soldiers.
  • 27. Software Rejuvenation Periodic preemptive rollback of continuously running applications to prevent failures in the future. Open - Not based on feedback from the system - Elapsed Time, Cumulative jobs in system Closed - Based on some notion of system health. Continuously monitor, analyze the estimated time to exhaustion of a resource. Trivedi et. al Duke University.
  • 28. Apache Web Server MaxRequestPerChild - If this value is set to a positive value, then the parent process of Apache kills a child process as soon as MaxRequestsPerChild request have been handled by this child process. By doing this, Apache limits “the amount of memory a process can consume by accidental memory leak”and “helps reduce the num of process when server load reduces.”
  • 29. Treating Bugs as Allergies Inspired by allergy treatment in real life. If you are allergic to milk, remove dairy products from your diet. Rollback the program to a recent checkpoint when a bug is detected, dynamically change the execution environment based on failure symptoms, and then re-execute the program in modified environment. Quin et. al SOSP 2005
  • 30. Treating Bugs As Allergies
  • 31. Examples Uninitialized reads may be avoided if every newly allocated buffer is filled with zeros. Data races can be avoided by changing time related event such as thread scheduling, asynchronous events.
  • 33. Comparison of Rx and Alternative Approaches For systems where reboot ~5sec is not good enough Checkpoint, Replay bounded by reboot ~5sec