Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing

Cognitive Behavior Analysis
framework for Fault Prediction
in Cloud Computing
(NoF’12, Nov 21st-23rd, 2012, Tunis, Tunisia)

Reza FARRAHI MOGHADDAM, Fereydoun FARRAHI MOGHADDAM,
Vahid ASGHARI, Mohamed CHERIET

Synchromedia Lab, ETS, University of Quebec, Montreal, Quebec, Canada

Laboratory for Multimedia
Communication in Telepresence

Outline

 Motivation for Behavior Analysis (BA) and
Failure Prediction
 Proposed BA framework
 Probabilistic Behavior Analysis
 Simulated Probabilistic Behavior Analysis
 Behavior-Time Profile Modeling and Analysis

 Scalability of the Proposed BA framework
 Conclusions and Future Prospects

11/23/2012 NoF’12 2

Why Behavior Analysis (BA)?
 Benefits of BA for Failure Prediction
 Preventing Service-Layer or System-Level failures
 Enabling operation in “unallowable” states to save
energy and cost, and also to reduce footprint
 Profiling the Actors
 Profiling end users, service providers, and other
actors in a computing business (for example, a
telecom business)
 The ensemble of these actors resembles more an
ecosystem than a system
 Profiling helps in:
• Smart management of resources
• Building reputations and trust for actors
• Identifying and isolating wrong-acting actors and threats
11/23/2012 NoF’12 3

Why Failure Prediction?
A new failure source: Cyclic ElastoPlastic Operation (CEPO)

Cyclic
elastoplastic Hardware factor
operation

Software Human Middleware Other
factor factors
factor factor

11/23/2012 NoF’12 4

Cyclic elastoplastic operation (CEPO):
in Civil and Mechanical Engineering

 Safe operation in plastic mode
 Repeatable transitions between elastic and
plastic modes
Plastic regime
 Cyclic operation is the key
Plastic
Elastic regime
Collapse Point

11/23/2012 NoF’12 5

Cyclic elastoplastic operation (CEPO):
its counterparts in Computing Systems

Carbon Enabling Effect and Green Push: Doing more with less
1. PUE of Data centers
Increasing inlet air flow temperature (2-4% energy saving per 1°C increase)
For example: PUE = 1.5, 20% saving (5°C)  PUE = 1.2
Reducing or eliminating fans
Failure at component level (servers) increases with temperature (ASHRAE TC
9.9. 2011)
Failure Prediction and Behavior Analysis can isolate component-level failures
(even before their occurrence) in order to prevent system-level failures (which
may violate SLO constraints)
Again, cyclic operation is the key to success
2. Can be applied to Bandwidth too?? Uncertainty increases with the
length of stay in the plastic mode
Bearable stress level
Plastic mode

Stress on System
Elastic mode

Allowable Elastic range Inlet temperature
11/23/2012 NoF’12 6

The Proposed BA framework

 An Ensemble-of-Experts approach:
 The sub-paradigms
• Probabilistic Behavior Analysis
• Simulated Probabilistic Behavior Analysis
• Behavior-Time Profile Modeling and Analysis
 Two different pictures:
 Systemic picture
 Ecosystemic picture

11/23/2012 NoF’12 7

BA Framework:
Systemic picture

11/23/2012 NoF’12 8

BA Framework:
Ecosystemic picture

11/23/2012 NoF’12 9

Multiple layers in
BA framework

Layers vs (physical and non-
physical) location: Toward Location Various layers
Intelligence in Computing systems  Hardware (Compute/Network)
 Hardware Drivers/Software
 Middleware/Protocols
 Virtualware
 Virtualware Drivers/Software
 Applications (Software)

11/23/2012 NoF’12 10

Sub-paradigm 1:
Probabilistic Behavior Analysis
 Each layer of system is considered as a graph
 Sub-graphs constitute super-components of

higher levels (vertical scaling)
 The behavior is modeled as PoA

 The PoA is related to CDF of failure:

 The Differential Density Function (DDF):

11/23/2012 NoF’12 11

Sub-paradigm 1:
 An example of a 2-component system:

11/23/2012 NoF’12 12

Sub-paradigm 1:
Tanh distribution
Tanh CDFs Tanh DDFs

11/23/2012 NoF’12 13

Sub-paradigm 1:

Lanl05 database Lanl05 database statistics
 Duration: 9 years

 Retrieved from FTA
 Availability statistics:
 19874 records
 mean = 1777.99 (hrs)
 std = 3462.33
 Skewness = 3.09
 GoF p-value (Tanh) = 0.500
 GoF p-value (Weib.) = 0.416
 Unavailability statistics:
 mean = 5.88 (hrs)
 std = 78.39
 Skewness = 43.96
11/23/2012 NoF’12 14

Sub-paradigm 2:
Simulated Probabilistic Behavior Analysis

 For highly-complex system topologies, the CDFs of
high-level sub-graphs and components is estimated
using simulation based on CDFs of basic components
 It can be also used to validate the calculations of the
first sub-paradigm
 Monte Carlo strategy is used
 In each run, the fault time of each basic component is

calculated randomly based on its CDF
 The cumulative behavior of all runs of the high-level
sub-graph is used to estimate its CDF
 1000-run simulations have been used

11/23/2012 NoF’12 15

Sub-paradigm 2:

MC simulation: G_1,1 MC simulation: G_2,1

11/23/2012 NoF’12 16

Sub-paradigm 2:

MC simulation: CDFs MC simulation: DDFs

11/23/2012 NoF’12 17

Sub-paradigm 3:
Behavior-Time Profile Modeling and Analysis

 Time-profile of components characteristics collected
by opportunistic agents across the system (or
ecosystem)
 Time-profile of state transitions in components and

also higher level sub-graphs at various layers
collected or injected by BSU
 Machine learning methods are used to match the
state transitions with the characteristics
 Support Vector Machine (SVM)
 Bayesian networks
 Agent-based data mining
 Fuzzy logic
 ···
11/23/2012 NoF’12 18

Sub-paradigm 3:

 Four motivations for behavior-time profile
analysis:
 Spontaneous faults compared to cause-and-effect
faults have been reduced significantly
• Less pure hardware-caused faults compared to interaction-
caused faults
 Patterns and cycles in fault occurrence and in
general in behavior
 Handling of faulty systems that do not have any
faulty components
• context-sensitive diagnosis [Lamperti2011]
 handling of gradual events

11/23/2012 NoF’12 19

Sub-paradigm 3:

A simple example:

11/23/2012 NoF’12 20

SLA and Service Grading
 Even without considering elastoplastic use case, BA can help in
upgrading a service (for example, to the telco grade)
 Probability of Availability (PoA): Lease-based business models
 Predicting, isolating and resolving failure events at component or sub-
system levels before they get to the Service Layer.
 Probability of Completion (PoC): Task-based business models
 Countermeasure options:
 Put out high risk components (maintenance tickets)
 Temporal redundancy
 But, all this depends on the ability to predict high risk or failure

 An example:
 No BA: Major fault mode with MTBF = 10 weeks, MTTR = 10
minutes  52:09 minutes downtime a year < 52:33  4nines
 With BA: 90% of faults are detected 15 minutes before system
failure  5:13 minutes downtime a year < 5:15  5nines

11/23/2012 NoF’12 21

Countermeasures and
cost savings

Two alternative modes to save
An example: Full system both energy (cost) and life
expectancy of components

11/23/2012 NoF’12 22

Scalability

Horizontal and Vertical scaling Federated scaling

11/23/2012 NoF’12 23

Conclusions and Future
Prospects
 A multi-paradigm, multi-layer, multi-level cognitive behavior analysis
framework is introduced
 Three sub-paradigms (cross-cover):
 Statistical inference
 Statistical inference by means of simulation
 Time-profile modeling and analysis
 Multiple granularity analysis and scalability:
 Horizontal, vertical and hierarchical scaling
 Including other layers in the analysis: virtualware and middleware
 Estimation of PoA to improve system dependability and its service grade
 A new distribution is introduced: Tanh distribution
 validated on a real database: lanl05 database
 Future Prospects:
 Large-scale operation of each sub-paradigm
 Cognitive Response: Multi-Expert Decision Making, Cognitive Models
 Integration of the framework with real computing systems:
• OpenStack, Open GSN
 Machine learning techniques for the time-profile modeling sub-paradigm
 Development of more sophisticated distributions

11/23/2012 NoF’12 24

Thanks you, Any question!
BATG

Reza Fereydoun Vahid Mohamed
FARRAHI FARRAHI ASGHARI, CHERIET,
MOGHADDAM, MOGHADDAM, Eng., Ph.D., MIEEE Eng., Ph.D., SMIEEE
Eng., Ph.D., MIEEE Eng., M.Sc., MIEEE vahid@emt.inrs.ca mohamed.cheriet@etsmtl.ca
imriss@ieee.org, farrahi@ieee.org,
rfarrahi@synchromedia.ca ffarrahi@synchromedia.ca
Research Associate PhD Student Postdoctoral Fellow Director of Synchromedia Lab

http://www.synchromedia.ca/
NSERC

Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing

Recommandé

Recommandé

Contenu connexe

Similaire à Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing

Similaire à Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing (20)

Plus de Reza Farrahi Moghaddam, PhD, BEng

Plus de Reza Farrahi Moghaddam, PhD, BEng (11)

Dernier

Dernier (20)

Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing