Why Teams call analytics are critical to your entire business
Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing
1. Cognitive Behavior Analysis
framework for Fault Prediction
in Cloud Computing
(NoF’12, Nov 21st-23rd, 2012, Tunis, Tunisia)
Reza FARRAHI MOGHADDAM, Fereydoun FARRAHI MOGHADDAM,
Vahid ASGHARI, Mohamed CHERIET
Synchromedia Lab, ETS, University of Quebec, Montreal, Quebec, Canada
Laboratory for Multimedia
Communication in Telepresence
2. Outline
Motivation for Behavior Analysis (BA) and
Failure Prediction
Proposed BA framework
Probabilistic Behavior Analysis
Simulated Probabilistic Behavior Analysis
Behavior-Time Profile Modeling and Analysis
Scalability of the Proposed BA framework
Conclusions and Future Prospects
11/23/2012 NoF’12 2
3. Why Behavior Analysis (BA)?
Benefits of BA for Failure Prediction
Preventing Service-Layer or System-Level failures
Enabling operation in “unallowable” states to save
energy and cost, and also to reduce footprint
Profiling the Actors
Profiling end users, service providers, and other
actors in a computing business (for example, a
telecom business)
The ensemble of these actors resembles more an
ecosystem than a system
Profiling helps in:
• Smart management of resources
• Building reputations and trust for actors
• Identifying and isolating wrong-acting actors and threats
11/23/2012 NoF’12 3
4. Why Failure Prediction?
A new failure source: Cyclic ElastoPlastic Operation (CEPO)
Cyclic
elastoplastic Hardware factor
operation
Software Human Middleware Other
factor factors
factor factor
11/23/2012 NoF’12 4
5. Cyclic elastoplastic operation (CEPO):
in Civil and Mechanical Engineering
Safe operation in plastic mode
Repeatable transitions between elastic and
plastic modes
Plastic regime
Cyclic operation is the key
Plastic
Elastic regime
Collapse Point
11/23/2012 NoF’12 5
6. Cyclic elastoplastic operation (CEPO):
its counterparts in Computing Systems
Carbon Enabling Effect and Green Push: Doing more with less
1. PUE of Data centers
Increasing inlet air flow temperature (2-4% energy saving per 1°C increase)
For example: PUE = 1.5, 20% saving (5°C) PUE = 1.2
Reducing or eliminating fans
Failure at component level (servers) increases with temperature (ASHRAE TC
9.9. 2011)
Failure Prediction and Behavior Analysis can isolate component-level failures
(even before their occurrence) in order to prevent system-level failures (which
may violate SLO constraints)
Again, cyclic operation is the key to success
2. Can be applied to Bandwidth too?? Uncertainty increases with the
length of stay in the plastic mode
Bearable stress level
Plastic mode
Stress on System
Elastic mode
Allowable Elastic range Inlet temperature
11/23/2012 NoF’12 6
7. The Proposed BA framework
An Ensemble-of-Experts approach:
The sub-paradigms
• Probabilistic Behavior Analysis
• Simulated Probabilistic Behavior Analysis
• Behavior-Time Profile Modeling and Analysis
Two different pictures:
Systemic picture
Ecosystemic picture
11/23/2012 NoF’12 7
9. BA Framework:
Ecosystemic picture
11/23/2012 NoF’12 9
10. Multiple layers in
BA framework
Layers vs (physical and non-
physical) location: Toward Location Various layers
Intelligence in Computing systems Hardware (Compute/Network)
Hardware Drivers/Software
Middleware/Protocols
Virtualware
Virtualware Drivers/Software
Applications (Software)
11/23/2012 NoF’12 10
11. Sub-paradigm 1:
Probabilistic Behavior Analysis
Each layer of system is considered as a graph
Sub-graphs constitute super-components of
higher levels (vertical scaling)
The behavior is modeled as PoA
The PoA is related to CDF of failure:
The Differential Density Function (DDF):
11/23/2012 NoF’12 11
12. Sub-paradigm 1:
Probabilistic Behavior Analysis
An example of a 2-component system:
11/23/2012 NoF’12 12
15. Sub-paradigm 2:
Simulated Probabilistic Behavior Analysis
For highly-complex system topologies, the CDFs of
high-level sub-graphs and components is estimated
using simulation based on CDFs of basic components
It can be also used to validate the calculations of the
first sub-paradigm
Monte Carlo strategy is used
In each run, the fault time of each basic component is
calculated randomly based on its CDF
The cumulative behavior of all runs of the high-level
sub-graph is used to estimate its CDF
1000-run simulations have been used
11/23/2012 NoF’12 15
16. Sub-paradigm 2:
Simulated Probabilistic Behavior Analysis
MC simulation: G_1,1 MC simulation: G_2,1
11/23/2012 NoF’12 16
17. Sub-paradigm 2:
Simulated Probabilistic Behavior Analysis
MC simulation: CDFs MC simulation: DDFs
11/23/2012 NoF’12 17
18. Sub-paradigm 3:
Behavior-Time Profile Modeling and Analysis
Time-profile of components characteristics collected
by opportunistic agents across the system (or
ecosystem)
Time-profile of state transitions in components and
also higher level sub-graphs at various layers
collected or injected by BSU
Machine learning methods are used to match the
state transitions with the characteristics
Support Vector Machine (SVM)
Bayesian networks
Agent-based data mining
Fuzzy logic
···
11/23/2012 NoF’12 18
19. Sub-paradigm 3:
Behavior-Time Profile Modeling and Analysis
Four motivations for behavior-time profile
analysis:
Spontaneous faults compared to cause-and-effect
faults have been reduced significantly
• Less pure hardware-caused faults compared to interaction-
caused faults
Patterns and cycles in fault occurrence and in
general in behavior
Handling of faulty systems that do not have any
faulty components
• context-sensitive diagnosis [Lamperti2011]
handling of gradual events
11/23/2012 NoF’12 19
20. Sub-paradigm 3:
Behavior-Time Profile Modeling and Analysis
A simple example:
11/23/2012 NoF’12 20
21. SLA and Service Grading
Even without considering elastoplastic use case, BA can help in
upgrading a service (for example, to the telco grade)
Probability of Availability (PoA): Lease-based business models
Predicting, isolating and resolving failure events at component or sub-
system levels before they get to the Service Layer.
Probability of Completion (PoC): Task-based business models
Countermeasure options:
Put out high risk components (maintenance tickets)
Temporal redundancy
But, all this depends on the ability to predict high risk or failure
An example:
No BA: Major fault mode with MTBF = 10 weeks, MTTR = 10
minutes 52:09 minutes downtime a year < 52:33 4nines
With BA: 90% of faults are detected 15 minutes before system
failure 5:13 minutes downtime a year < 5:15 5nines
11/23/2012 NoF’12 21
22. Countermeasures and
cost savings
Two alternative modes to save
An example: Full system both energy (cost) and life
expectancy of components
11/23/2012 NoF’12 22
24. Conclusions and Future
Prospects
A multi-paradigm, multi-layer, multi-level cognitive behavior analysis
framework is introduced
Three sub-paradigms (cross-cover):
Statistical inference
Statistical inference by means of simulation
Time-profile modeling and analysis
Multiple granularity analysis and scalability:
Horizontal, vertical and hierarchical scaling
Including other layers in the analysis: virtualware and middleware
Estimation of PoA to improve system dependability and its service grade
A new distribution is introduced: Tanh distribution
validated on a real database: lanl05 database
Future Prospects:
Large-scale operation of each sub-paradigm
Cognitive Response: Multi-Expert Decision Making, Cognitive Models
Integration of the framework with real computing systems:
• OpenStack, Open GSN
Machine learning techniques for the time-profile modeling sub-paradigm
Development of more sophisticated distributions
11/23/2012 NoF’12 24
25. Thanks you, Any question!
BATG
Reza Fereydoun Vahid Mohamed
FARRAHI FARRAHI ASGHARI, CHERIET,
MOGHADDAM, MOGHADDAM, Eng., Ph.D., MIEEE Eng., Ph.D., SMIEEE
Eng., Ph.D., MIEEE Eng., M.Sc., MIEEE vahid@emt.inrs.ca mohamed.cheriet@etsmtl.ca
imriss@ieee.org, farrahi@ieee.org,
rfarrahi@synchromedia.ca ffarrahi@synchromedia.ca
Research Associate PhD Student Postdoctoral Fellow Director of Synchromedia Lab
http://www.synchromedia.ca/
NSERC