SlideShare une entreprise Scribd logo
1  sur  33
NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: Error Detection
and Diagnosis of Sporadic
Operations on Cloud Applications
Dr. Liming Zhu
Liming.Zhu@nicta.com.au
Principal Researcher, NICTA/UNSW
April, 2014 at Berkeley AMPLab
NICTA Copyright 2012 From imagination to impact
Outline
• Dependable Cloud Operation
• Approach: Process-Oriented Dependability (POD)
– POD-Diagnosis
– Undo/Recovery Planning using AI Planning
– Modeling and Analysis using DTMC
• Connections with AMPLab BDAS
2
NICTA Copyright 2012 From imagination to impact
Dependable Cloud Operation: Motivation
• Sporadic operations cause most outages
– Deployment, reconfiguration, (rolling) upgrade, rollback…
• as opposed to normal operations
– DevOps-related: continuous integration/deploy/delivery
• Etsy.com: 25 full deployments per day at 10 commits per deploy
– Other drivers: resource sharing, micro services/partition
migration, backup/recovery, auto-mitigation itself…
• Limited control & visibility during sporadic operation
– Heavy reliance on Cloud APIs
– Limited visibility and exception handling capabilities
3
NICTA Copyright 2012 From imagination to impact
Dependable Cloud Operation: Challenges
• Our Context
– Large-scale web/enterprise operation in Cloud
– Distributed data analytics in Cloud (Hadoop/Spark)
• Goal: detect, diagnose and react to errors
occurring during a sporadic cloud operation
• Challenges
1. Anomaly detection during sporadic operations
2. Undo/Recovery planning for recovery
3. Modelling and analysis of sporadic operation
4
NICTA Copyright 2012 From imagination to impact
Sporadic Operation Example: Rolling Upgrade
Update Auto-Scaling
Group (ASG)
Remove & Deregister
Old Instances from ELB
Wait for ASG to Start
New Instances
Terminate Old Instances
Register New Instances
with ELB
Sort Instances
Stop
Start
- Have 100 servers in cloud with
version 1 software
- Upgrade 10 servers at a time to
version 2 software
- No downtime or redundancy cost
- Potentially take a long time to
complete with errors during the
operation with other interfering
operations
5
NICTA Copyright 2012 From imagination to impact
Challenge 1: Anomaly Detection
• Traditional anomaly-based error detection is
designed for “normal operation”
– significant false positives OR disable all monitoring
during sporadic operation
• Continuous changes to the production systems
– From months at scheduled downtime to hours at all times
– Multiple operations at the same time
• Quality of automation scripts + human
– fully testing the operation (scripts + human) in uncertain
cloud environment is very difficult
6
NICTA Copyright 2012 From imagination to impact
Our Approach: Use Process Context
• Offline: treat an operation as a process
– Process discovered automatically from logs/scripts
• Clustering of log lines and process mining
– Intermediary step outcomes specified as assertions
• Online: use process context
– Process context: process/instance/step ids, expected states…
– Errors detected by examining logs and monitoring data
• Assertions evaluations integration with monitoring facilities
• Compliance checking against expected processes using logs
– Detected errors are further diagnosed for (root) causes
• Examining a fault tree to locate potential root causes
• Performing more diagnostic tests and on-demand assertions
X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014. 7
NICTA Copyright 2012 From imagination to impact
Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Service
Controls
Outputs Create SnapshotCheck AZs
Create instance
from snapshot
Create AMI from
instance
Evaluate AMI
Discovered
Model
Asgard Log dataLog dataGenerates
Offline
Online
8
NICTA Copyright 2012 From imagination to impact
Process Mining Service: how it works
• Process Mining: Discovery
1. Collect the logs (using Logstash)
2. Filter the logs
3. Calculating string distance
(Levenshtein distance) between
each pair of log lines
4. Cluster the log lines
5. Look at the dendrogram to
decide on threshold
6. Name & combine clusters
7. Derive regular expressions for the
clusters
8. Classify the log lines using the
regular expressions and cluster
names
9. Import altered log into process
mining tools
10. Apply different process discovery
algorithms
11. If anything requires changes, go
back to the respective steps and
redo from there
9
NICTA Copyright 2012 From imagination to impact
POD-Detection: Error Detection
Error Detection Service has two
methods for detecting errors:
• Assertion Checking
• Conformance Checking
10
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
Assertions:
11
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
Assertions:
• i has been de-registered
from ELB
• i has been removed from
ASG
• there is 1 less instance of v1
12
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
Assertions:
• i successfully terminated
13
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• Wait ...
Assertions:
• Next log line should appear
within 17m35s (95 percentile)
14
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• Wait ...
• New instance ...
Assertions:
• i„ successfully launched
15
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
16
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
17
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
18
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
• Wait ...
19
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
• Wait ...
• Terminate ...???
20
NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: how it works
• Fault trees are built as
knowledge base
• Process context used for fault
tree pruning
• On-demand diagnosis tests
to locate the (root) causes
21
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
• Experiments
– Rolling upgrade of 100+ node cluster in AWS
• Fault injection+ confounding processes: random kill, scaling-in..
• Detected errors
– Assertion checking: known errors and global errors
• Examples: key management, launch configuration, images…
– Compliance checking: unknown errors
• skipping activities or undone activities
• Time and precision
– Compared with Asgard/Monitoring internal mechanisms
• Detected more errors earlier
– Diagnosis: limited to known causes in the fault tree
• 95 percentile less than 4s; accuracy ranges 80%~100%
22
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
23
NICTA Copyright 2012 From imagination to impact
Other Related Research
Challenges
1. Anomaly detection during sporadic operations
2. Undo/Recovery planning
3. Modelling and analysis of sporadic operation
24
NICTA Copyright 2012 From imagination to impact
Challenge 2: Undo/Recovery Planning
S1 S2
Serr
A certain
step
Reparation
Compensation Undo
Parameterizable Redo
Alternative
Checkpoint-base Undo
Previous states
… ... S0S-i
25
NICTA Copyright 2012 From imagination to impact
Undo/Undoability Approach in a Nutshell
• Goal: undo support for
“indirect control” setting
– Problem 1: some actions are
irreversible, e.g., delete
– Problem 2: undo ≠ copy back
previous state of memory
• Have to call the right actions on the
right resources in the right order
– Problem 3: partly irreversible
operations, e.g. on Amazon WS:
• Stopping a machine disassociates an
elastic IP address (if any), and
releases internal IP / public DNS
• Starting the machine isn‟t undo:
elastic IP is dangling, internal IP /
public DNS / timestamps are different
• Solution components:
 Replace “do” with “pseudo-do”
 Undo System based on AI Planning
• Outcome: sequence of undo actions
 Undoability Checking:
• Is the operation I‟m about to execute
undoable?
• Learn which aspects can be fully undone
for each operation (whole domain)
• If not, can we abstract / change so that
undoability is given?
 Projection (of a domain)
26
Ingo Weber et. al. Supporting undoability in systems operations. In USENIX LISA'13: Large Installation System
Administration Conference, Washington, DC, USA, November 2013.
NICTA Copyright 2012 From imagination to impact
Undoability Checking Approach
Operation(s) to execute
(e.g., script, command)
Resources and
properties required
to be undoable
Define
Tool user
(e.g., sys admin)
Tool provider
Full domain model
(e.g., AWS)
Projection
Specification
Generate
Undoability CheckerDefine
Apply
Projection
Generate
Projected
domain
model
Per operation:
Generate pre and
post-states
Check undoability per
pre-post state pair
Undoability (yes/no)
List of causes if not
undoable
Result
Feedback
For each
pair: call
AI Planner
27
NICTA Copyright 2012 From imagination to impact
Challenge 3: Modeling and Analysis
• Approach: Model as stochastic processes
– Discrete/Continuous Markov Chain (DTMC/CTMC)
• Forward states: net successful operations
• Backward states: failure or deliberate rollback/undo
• A family of g-k chains with different parameters
– g: rolling-upgrade wave granularity. k: no. of failure/rollback per wave
Daniel Sun & L Zhu, et. al. ” Understanding Rolling Upgrade” 33th International Symposium on Reliable Distributed
Systems (SRDS), 2014 (submitted)
28
NICTA Copyright 2012 From imagination to impact
Model used for
Predictions
- e.g. completion time,
failure rate impact
Optimization and Decision
Problems
- e.g. when to activate new
versions to guarantee a
99.99% success
29
NICTA Copyright 2012 From imagination to impact
Connection with AMPLab BDAS
30
NICTA Copyright 2012 From imagination to impact
Projects Related to BDAS (1/2)
1. Log/Metrics analysis in POD-Diagnosis
– Currently using Spark/MLBase
– Voluminous log/events into Spark Streaming
2. Dependable deployment/operation of BDAS
– POD applied to Hadoop before, maybe BDAS?
3. Multi-level granularity access for data analytics
– Australian Urban Research Infrastructure Network (AURIN)
• Portal to provide transport-related data to international researchers
• Cluster sharing for in-portal pre-processing and analytics
• de-anonymization concerns and different views for the same data
– Evaluating how BDAS can support this
31
NICTA Copyright 2012 From imagination to impact
Projects Related to BDAS (2/2)
Redacted
4. Data scientist workflow and local exploration
5. Distributed machine learning
32
NICTA Copyright 2012 From imagination to impact
Team Acknowledgement
• Researchers
– Len Bass
– Alan Fekete
– Anna Liu
– Daniel Sun
– Hiroshi Wada
– Ingo Weber
– Sherry Xu
– Liming Zhu
• Engineers
– Adnene Guabtni
– Chao Li
• Students
– Amer Abdalamer
– Ahmed Alqahtani
– Mostafa Farshchi
– Min Fu
– Jin Li
– Matthew Sladescu
– Donna Xu
– DongYao Wu
33

Contenu connexe

Similaire à POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications

Error in hadoop
Error in hadoopError in hadoop
Error in hadoop
Len Bass
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
Walid Shaari
 

Similaire à POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications (20)

Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...
 
Automatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningAutomatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI Planning
 
SolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds Scalability for the Enterprise
SolarWinds Scalability for the Enterprise
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
 
Modelling and Analysing Operation Processes for Dependability
Modelling and Analysing Operation Processes for Dependability Modelling and Analysing Operation Processes for Dependability
Modelling and Analysing Operation Processes for Dependability
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoop
 
Operational Visibiliy and Analytics - BU Seminar
Operational Visibiliy and Analytics - BU SeminarOperational Visibiliy and Analytics - BU Seminar
Operational Visibiliy and Analytics - BU Seminar
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Monitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In AzureMonitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In Azure
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science Platform
 
Fast and effective analysis of architecture diagrams
Fast and effective analysis of architecture diagrams Fast and effective analysis of architecture diagrams
Fast and effective analysis of architecture diagrams
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
 
A Taste of Monitoring and Post Mortem Debugging with Node
A Taste of Monitoring and Post Mortem Debugging with Node A Taste of Monitoring and Post Mortem Debugging with Node
A Taste of Monitoring and Post Mortem Debugging with Node
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
Twelve Factor - Designing for Change
Twelve Factor - Designing for ChangeTwelve Factor - Designing for Change
Twelve Factor - Designing for Change
 
WQD2011 - INNOVATION - DEWA - Substation Signal Analyzer Software
WQD2011 - INNOVATION - DEWA - Substation Signal Analyzer SoftwareWQD2011 - INNOVATION - DEWA - Substation Signal Analyzer Software
WQD2011 - INNOVATION - DEWA - Substation Signal Analyzer Software
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 

Plus de Liming Zhu

International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...
Liming Zhu
 

Plus de Liming Zhu (20)

AI Transformation A Clash with Human Expertise
AI TransformationA Clash with Human ExpertiseAI TransformationA Clash with Human Expertise
AI Transformation A Clash with Human Expertise
 
Deciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AIDeciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AI
 
GenAI in Research with Responsible AI
GenAI in Researchwith Responsible AIGenAI in Researchwith Responsible AI
GenAI in Research with Responsible AI
 
AI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future FrontiersAI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future Frontiers
 
Software Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based SystemsSoftware Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based Systems
 
AI Transformation
AI TransformationAI Transformation
AI Transformation
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdf
 
Trends & Innovation in Cyber and Digitaltech
Trends & Innovationin Cyber and DigitaltechTrends & Innovationin Cyber and Digitaltech
Trends & Innovation in Cyber and Digitaltech
 
Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models
 
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AIICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
 
International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...
 
RegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsRegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and Lessons
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61
 
Responsible AI The Australian Approach
Responsible AIThe Australian ApproachResponsible AIThe Australian Approach
Responsible AI The Australian Approach
 
Distributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsDistributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based Systems
 
Distributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingDistributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of Everything
 
Cyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsCyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and Solutions
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
 
Responsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksResponsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risks
 
Dependable Operations
Dependable OperationsDependable Operations
Dependable Operations
 

Dernier

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Dernier (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications

  • 1. NICTA Copyright 2012 From imagination to impact POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications Dr. Liming Zhu Liming.Zhu@nicta.com.au Principal Researcher, NICTA/UNSW April, 2014 at Berkeley AMPLab
  • 2. NICTA Copyright 2012 From imagination to impact Outline • Dependable Cloud Operation • Approach: Process-Oriented Dependability (POD) – POD-Diagnosis – Undo/Recovery Planning using AI Planning – Modeling and Analysis using DTMC • Connections with AMPLab BDAS 2
  • 3. NICTA Copyright 2012 From imagination to impact Dependable Cloud Operation: Motivation • Sporadic operations cause most outages – Deployment, reconfiguration, (rolling) upgrade, rollback… • as opposed to normal operations – DevOps-related: continuous integration/deploy/delivery • Etsy.com: 25 full deployments per day at 10 commits per deploy – Other drivers: resource sharing, micro services/partition migration, backup/recovery, auto-mitigation itself… • Limited control & visibility during sporadic operation – Heavy reliance on Cloud APIs – Limited visibility and exception handling capabilities 3
  • 4. NICTA Copyright 2012 From imagination to impact Dependable Cloud Operation: Challenges • Our Context – Large-scale web/enterprise operation in Cloud – Distributed data analytics in Cloud (Hadoop/Spark) • Goal: detect, diagnose and react to errors occurring during a sporadic cloud operation • Challenges 1. Anomaly detection during sporadic operations 2. Undo/Recovery planning for recovery 3. Modelling and analysis of sporadic operation 4
  • 5. NICTA Copyright 2012 From imagination to impact Sporadic Operation Example: Rolling Upgrade Update Auto-Scaling Group (ASG) Remove & Deregister Old Instances from ELB Wait for ASG to Start New Instances Terminate Old Instances Register New Instances with ELB Sort Instances Stop Start - Have 100 servers in cloud with version 1 software - Upgrade 10 servers at a time to version 2 software - No downtime or redundancy cost - Potentially take a long time to complete with errors during the operation with other interfering operations 5
  • 6. NICTA Copyright 2012 From imagination to impact Challenge 1: Anomaly Detection • Traditional anomaly-based error detection is designed for “normal operation” – significant false positives OR disable all monitoring during sporadic operation • Continuous changes to the production systems – From months at scheduled downtime to hours at all times – Multiple operations at the same time • Quality of automation scripts + human – fully testing the operation (scripts + human) in uncertain cloud environment is very difficult 6
  • 7. NICTA Copyright 2012 From imagination to impact Our Approach: Use Process Context • Offline: treat an operation as a process – Process discovered automatically from logs/scripts • Clustering of log lines and process mining – Intermediary step outcomes specified as assertions • Online: use process context – Process context: process/instance/step ids, expected states… – Errors detected by examining logs and monitoring data • Assertions evaluations integration with monitoring facilities • Compliance checking against expected processes using logs – Detected errors are further diagnosed for (root) causes • Examining a fault tree to locate potential root causes • Performing more diagnostic tests and on-demand assertions X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014. 7
  • 8. NICTA Copyright 2012 From imagination to impact Example: Rolling Upgrade Using Asgard Read by Operator Process Mining Service Controls Outputs Create SnapshotCheck AZs Create instance from snapshot Create AMI from instance Evaluate AMI Discovered Model Asgard Log dataLog dataGenerates Offline Online 8
  • 9. NICTA Copyright 2012 From imagination to impact Process Mining Service: how it works • Process Mining: Discovery 1. Collect the logs (using Logstash) 2. Filter the logs 3. Calculating string distance (Levenshtein distance) between each pair of log lines 4. Cluster the log lines 5. Look at the dendrogram to decide on threshold 6. Name & combine clusters 7. Derive regular expressions for the clusters 8. Classify the log lines using the regular expressions and cluster names 9. Import altered log into process mining tools 10. Apply different process discovery algorithms 11. If anything requires changes, go back to the respective steps and redo from there 9
  • 10. NICTA Copyright 2012 From imagination to impact POD-Detection: Error Detection Error Detection Service has two methods for detecting errors: • Assertion Checking • Conformance Checking 10
  • 11. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: Assertions: 11
  • 12. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... Assertions: • i has been de-registered from ELB • i has been removed from ASG • there is 1 less instance of v1 12
  • 13. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... Assertions: • i successfully terminated 13
  • 14. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... Assertions: • Next log line should appear within 17m35s (95 percentile) 14
  • 15. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... • New instance ... Assertions: • i„ successfully launched 15
  • 16. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: 16
  • 17. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... 17
  • 18. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... 18
  • 19. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ... 19
  • 20. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ... • Terminate ...??? 20
  • 21. NICTA Copyright 2012 From imagination to impact POD-Diagnosis: how it works • Fault trees are built as knowledge base • Process context used for fault tree pruning • On-demand diagnosis tests to locate the (root) causes 21
  • 22. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis • Experiments – Rolling upgrade of 100+ node cluster in AWS • Fault injection+ confounding processes: random kill, scaling-in.. • Detected errors – Assertion checking: known errors and global errors • Examples: key management, launch configuration, images… – Compliance checking: unknown errors • skipping activities or undone activities • Time and precision – Compared with Asgard/Monitoring internal mechanisms • Detected more errors earlier – Diagnosis: limited to known causes in the fault tree • 95 percentile less than 4s; accuracy ranges 80%~100% 22
  • 23. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis 23
  • 24. NICTA Copyright 2012 From imagination to impact Other Related Research Challenges 1. Anomaly detection during sporadic operations 2. Undo/Recovery planning 3. Modelling and analysis of sporadic operation 24
  • 25. NICTA Copyright 2012 From imagination to impact Challenge 2: Undo/Recovery Planning S1 S2 Serr A certain step Reparation Compensation Undo Parameterizable Redo Alternative Checkpoint-base Undo Previous states … ... S0S-i 25
  • 26. NICTA Copyright 2012 From imagination to impact Undo/Undoability Approach in a Nutshell • Goal: undo support for “indirect control” setting – Problem 1: some actions are irreversible, e.g., delete – Problem 2: undo ≠ copy back previous state of memory • Have to call the right actions on the right resources in the right order – Problem 3: partly irreversible operations, e.g. on Amazon WS: • Stopping a machine disassociates an elastic IP address (if any), and releases internal IP / public DNS • Starting the machine isn‟t undo: elastic IP is dangling, internal IP / public DNS / timestamps are different • Solution components:  Replace “do” with “pseudo-do”  Undo System based on AI Planning • Outcome: sequence of undo actions  Undoability Checking: • Is the operation I‟m about to execute undoable? • Learn which aspects can be fully undone for each operation (whole domain) • If not, can we abstract / change so that undoability is given?  Projection (of a domain) 26 Ingo Weber et. al. Supporting undoability in systems operations. In USENIX LISA'13: Large Installation System Administration Conference, Washington, DC, USA, November 2013.
  • 27. NICTA Copyright 2012 From imagination to impact Undoability Checking Approach Operation(s) to execute (e.g., script, command) Resources and properties required to be undoable Define Tool user (e.g., sys admin) Tool provider Full domain model (e.g., AWS) Projection Specification Generate Undoability CheckerDefine Apply Projection Generate Projected domain model Per operation: Generate pre and post-states Check undoability per pre-post state pair Undoability (yes/no) List of causes if not undoable Result Feedback For each pair: call AI Planner 27
  • 28. NICTA Copyright 2012 From imagination to impact Challenge 3: Modeling and Analysis • Approach: Model as stochastic processes – Discrete/Continuous Markov Chain (DTMC/CTMC) • Forward states: net successful operations • Backward states: failure or deliberate rollback/undo • A family of g-k chains with different parameters – g: rolling-upgrade wave granularity. k: no. of failure/rollback per wave Daniel Sun & L Zhu, et. al. ” Understanding Rolling Upgrade” 33th International Symposium on Reliable Distributed Systems (SRDS), 2014 (submitted) 28
  • 29. NICTA Copyright 2012 From imagination to impact Model used for Predictions - e.g. completion time, failure rate impact Optimization and Decision Problems - e.g. when to activate new versions to guarantee a 99.99% success 29
  • 30. NICTA Copyright 2012 From imagination to impact Connection with AMPLab BDAS 30
  • 31. NICTA Copyright 2012 From imagination to impact Projects Related to BDAS (1/2) 1. Log/Metrics analysis in POD-Diagnosis – Currently using Spark/MLBase – Voluminous log/events into Spark Streaming 2. Dependable deployment/operation of BDAS – POD applied to Hadoop before, maybe BDAS? 3. Multi-level granularity access for data analytics – Australian Urban Research Infrastructure Network (AURIN) • Portal to provide transport-related data to international researchers • Cluster sharing for in-portal pre-processing and analytics • de-anonymization concerns and different views for the same data – Evaluating how BDAS can support this 31
  • 32. NICTA Copyright 2012 From imagination to impact Projects Related to BDAS (2/2) Redacted 4. Data scientist workflow and local exploration 5. Distributed machine learning 32
  • 33. NICTA Copyright 2012 From imagination to impact Team Acknowledgement • Researchers – Len Bass – Alan Fekete – Anna Liu – Daniel Sun – Hiroshi Wada – Ingo Weber – Sherry Xu – Liming Zhu • Engineers – Adnene Guabtni – Chao Li • Students – Amer Abdalamer – Ahmed Alqahtani – Mostafa Farshchi – Min Fu – Jin Li – Matthew Sladescu – Donna Xu – DongYao Wu 33