SlideShare une entreprise Scribd logo
1  sur  51
1




ENFORCING USER-
DEFINED MANAGEMENT
LOGIC IN LARGE SCALE
SYSTEMS
  Srinath Perera
  Indiana University, Bloomington
2
                    Outline
   Motivation & the Problem
   Related Work
   Proposed Architecture
   Scalability Results
   Robustness
   Contributions
3
    Motivation: Large Scale systems
•   IT is becoming a part of our everyday life
    • Increases size of potential user bases of systems
      (Google, Facebook, Amazon …).
    • Information Avalanche.
    • National, Global scale data collection
    • Success in this setting is decided by our ability to make
      sense of this data – scale matters (Google!).
•   Technological advances
    • Connectivity , SOA, Complex systems possible.
    • Computing power everywhere (multicore, smart phones).
    • Cloud - Lower the barrier for scale.

     We have the need and means to build large
                  scale systems
4
       Building them is Feasible, but
        Keeping them Running ??
   Changes are a norm rather than an exception – “10,000 servers,
    each having MTTF of thousand days => 10 failures/day” [Jeff Dean].
   High Operational Cost - When a system scales up, complexity
    increases.
     ◦ More than 75% TCO (Total Cost of Ownership) based on Patterson et al.
       data. (Dominated by salaries.)
     ◦ 50% IT budget spent on recovering from failures [Ganek et al.]
   Unreliable Middleware - Grid reliability among all operations 55%-
    80% [Khalili et al.]. Then the success rate of a service or a workflow
    that has 6 grid operations is 0.26 !!!
   Efforts to avoid failures have been unsuccessful – “Not a problem to
    be solved, but a fact to cope with” [Patterson]

    System Management is a Potential Solution to
                 this Problem!!
5
6
     Management Framework for
     Large Scale Systems should
   Support user-defined Management Logic
    ◦ Management usecases differ from system to system
    ◦ => only big organizations can afford to build specific frameworks
    ◦ => need user-defined management logic.
    ◦ Ease of authoring management logic is important.
   Scalable
   Robust – changes are a norm rather than an exception!
   Dynamic - resources often join and leave.
Need a dynamic and robust management framework that
        supports user-defined management logic.
The Problem                                     7
    Large scale systems need many managers
     ◦ One manager does not scale nor robust
    Each manager has a Partial view of the system
     ◦ a subset of resources are assigned to each manager
    But a Global view is Preferred (ease of authoring logic)
     ◦ Logic that work on local data need emergent properties, and
       hard for user to author them.
     ◦ We all think in terms of global properties,
    Example : “If the system does not have 5 message brokers,
     create new brokers and connect them to the broker
     network.” : detect <5 brokers, find the best place to create
     new one, create new one, and connect it to existing brokers.
    Problem: Enforcing user-defined management logic
        (that depends on a global view) on large-scale
      systems? And Application of such a framework to
                      manage systems.
8
                 Related Work
   Systems without Global Control
    ◦ Centralized management systems (e.g. Rainbow)
    ◦ Managers that act independently (e.g. Extreme (Kx),
      DREAM), and manual coordination (e.g. IBM Tivoli).
   Systems with Global Control
    ◦ Decentralized control - DMonA , and Deugo et al. -
    ◦ Monitor and run a State Machine of the system - Dubey
      et al.
    ◦ Consistent Shared View - Georgiadis et al., component
      Managers collaborate via total ordered multicast to
      maintain a system according to architectural constraints.
9
            Related Work(Contd.)
   Systems with Global Control (Contd.)
    ◦ Management Hierarchy
       Management hierarchy where the topmost layer is replicated
        (E.g. Monalisa ,Gadgil et al.).
       Typically Aggregation is used at each level.
       Aggregation hides information about a single resource.
    ◦ Hierarchy with Policies
       WildCat - agent group based hierarchy that communicates via
        whiteboards and use policies to control agents. Authors
        concern about the scalability of whiteboards.
    ◦ Cooperating Managers - No Global control loop
       Schoenwaelder - a group of cooperating agents and a master
        agent (IP multi-cast)
       ANDREA - create dynamic Hierarchies, delegate tasks to
        other managers via delegate statements in the management
        logic.
Approach           Scalabl   Robust      Ease of            Problems
                          e                   Writing                                   10
                                            management
                                               logic
Decentralized           Highly    Yes       Hard         Hard for users to write
control (e.g. DMonA                                      rules to achieve emergent
, and Deugo et al .)                                     behavior
Complex Event           Yes       Possibl   Not Easy     Event model has limited
processing                        e                      Memory
(DREAM)
Consistent view         No        Yes       Yes          Need ordered reliable
across managers                                          multicast – does not scale
(e.g. Georgiadis et
al. )
Hierarchical control    Highly    Possibl   Not Easy     Lose identity of a single
with aggregation                  e                      resource due to
(Monalisa)                                               aggregation
Hierarchy with          Yes       Possibl   Possible     Policies are not as explicit
Policies at each                  e                      as rules.
level (e. g. WildCat)
State Machine           Yes       Possibl   Not Easy     Users have to construct
(Dubey et al. )                   e                      this state machine, which
11
       Outline of the Evidence
   Solution: Hasthi Architecture
   Useful
    ◦ Application to a Large-Scale E-Science Project (LEAD)
   Sound
    ◦ Scalable (Empirical results)
    ◦ Robust and Dynamic (Proof + Empirical results)
   Main Contribution
      “Proposing, implementing, and analyzing a
      dynamic and robust management architecture,
      which can manage large-scale systems by
      enforcing user-defined management logic that
      depend on a global view of the managed system
      state, and application of the management logic to
      manage systems.”
12
                 Big Picture (Hasthi)




   Hasthi Has three Parts
        Manager Cloud – distributed architecture that binds managers
         and resources in the system as one cohesive unit.
        Meta-Model that represents the system state.
        Decision Framework.
Manager Cloud                               13




   Managers form a P2P network (Pastry), which is used for
    Initialization and Recovery (Elections).
   Normal Operations use SOAP over HTTP
14
Meta-Model                                       15




   Meta-model represents the monitoring data collected from the system.
    Summarized meta-model provides a global view.
   Delta-consistency – changes are reflected within a bounded time (a
    concept borrowed from shared memory multiprocessors [see Singla et
    al.]).
Decision Framework                                         16




   Users define management logic as rules: Local and Global.
   Manager control loops evaluate partial meta-models using local rules.
   The coordinator control loop evaluates the summarized meta-models
    using global rules (Global view).
   Actions triggered by rules analyze meta-model and decide on solutions.
Management Rules                                             17
      Rules (Drools) evaluate meta-objects (which represent resources) and
       execute actions, which analyze meta-objects and decide on solutions.
rule "RestartFailedServices"
when
    service:ManagedService(state == "CrashedState");
    host:Host(state != "CrashedState", service.host == name);
then
    system.invoke(new RestartAction(service),
         new ActionCallback() {
             public void actionSucessful(ManagementAction action) { ..... }
             public void actionFailed(ManagementAction action,Throwable e) {
                 service.setState("UnRepairableState");
                 system.invoke(
                    new UserInteractionAction(system, service, action,e));
    }});
end

      When the condition given using the object query language is met,
       actions in the then-clause are carried out.
      Use Rete algorithm to evaluate meta-objects and execute corrective
       actions. Tradeoff between space and time.
Management Actions                                      18

       Action Types
    1. Create a New service
    2. Restart a running service or recover a failed service
    3. Relocate a service
    4. Tune and configure a resource – change the configuration
       of a resource or change the structure of the system.
    5. User Interaction Action
       Actions implementation:
    ◦    Use shell scripts (e.g. service start or stop) and execute
         them using a Host Agent running in each host.
    ◦    Use Hasthi Agent integrated with each resource.
       Hasthi provides default management actions, but
        users can write their own.
Management Complexities                                       19

Even with a Global view, management can go wrong in many
ways. Following are some complexities and proposed
remedies (Chapter 7 for details).
1. Failed Management Actions– Hasthi uses the resource
   lifecycle, which sets resource state as “Unrecoverable” if an
   action failed, and ask for user help.
2. Lost system structure (broken links) – services can use the
   “dependency-discovery” operation to find other services.
3. Lost state – Hasthi does not preserve state but helps
   resources to locate their storage locations. (resource expose
   the location as a property and Hasthi pass it as a argument
   when it recovers the services)
4. Lost messages – retry + session level checkpoints
5. Fail positives (Custom failure detectors) & Network Paritions
20
           Application of Hasthi
  Find 10%
  Errors that
happen 90% of
   the time
                            Figure Out how
                              to preserve
                             state across
                                changes
21
 LEAD
Usecase




   LEAD services are stateless or have a persistent state. Data storage
    is best effort. We can recover by restarting services.
   Recover from Host & Service Failures – restart the failed services
   Recover workflows - Detect when the system has failed and
    recovered and resurrect any failed Workflows.
Scalability: Test Setup                                     22




                                                                        Q?


Main Test Setup                              Coordinator Test Setup:
   Large scale deployment of LEAD.           Test-Manager that simulates all
   Multiple replicas of the complete LEAD     messages generated by a normal
    stack.                                     manager managing a set of
   Each service simulates a management        resources.
    workload using a randomized               We simulated a large-scale
    algorithm.                                 system using Test-Managers.
   Set of rules to manage the system,        The coordinator does not see a
    and each test ran for a 1 hour with 30
                                               difference.
    seconds epoch time.
23
Measurements (Metrics)
One Manager Overhead (Resource Heartbeat Latency,
     Manager Loop Overhead, Manager Heartbeat Latency) 24




Managers Overhead (Coordinator Loop, Manager Heartbeat )




   One manager scales to 5000-8000 resources, Hasthi scales more with
    added managers. Need more tests to find the limits.
Coordinator Limit: (Manager Heartbeat Latency, 25
Coordinator Loop Overhead) vs. Resource count




   Close to a Linear overhead, the coordinator scales to 100,000
    resources and 1000 managers, and the number of managers does not
    make a much difference.
   Why? (1) Summarization, (2) Only transfer Changes, (3) Rete
    Algorithm, which only evaluates changes (tradeoff between speed vs.
    memory).
Manager Independence: (Resource heartbeat, Manager
Loop vs. Manager Heartbeat) vs. resources per Manager 26




   We measured the limit of a manager and the limit of the coordinator.
   Hypothesis: a manager overhead only depends on resources assigned
    to a manager, not on other managers or resources in the system
       we can scale up Hasthi (e.g. 100 managers, 1000 resources each).
   Verify Hypothesis:
       A Scatter Plot: overhead vs. number of resources per Manager.
       Same X values are reasonably close to each other.
       Hypothesis is valid till 2000 resources at least.
   Why? Managers do not usually interact with other managers, but talk
    with the coordinator.
27
         Scalability: Summary
1.   One manager scales to 5000-8000 resources.
2.   Managers only depend on resources assigned to
     them (at least till 2000 resources) and are not
     affected by other Managers in the system.
3.   Coordinator scales to 100,000 resources and 1000
     managers (100-1000 resources per manager < 2000
     limit in #2).

                                                  Q?


          System scales to 100,000 resources.
Robustness: Correctness Proof                                            28

    Self Stabilization = the system reaches a safe state regardless of the initial
                      state and continues to be at that state.

   We proved (in Chapter 5) given a system managed with Hasthi there
    exists a constant h for that system such that Hasthi Self Stabilizes if
    managers do not join or leave and communication failures do not happen
    for a continuous h time interval.




   Proof Outline: We took all states and proved that for any state there is a
    forced sequence that recovers the system within a bounded time.
29
               Availability of Hasthi
   Availability = MTTF/(MTTF+MTTR) -----------------------------------(1).
    The Proof provides the recovery time. Let us use that to calculate
            Availability as a function of MTTF of a single manager.
   Let us Assume a system managed with n independent managers
    each manager having MTTF (Mean Time To Failure) of Ѳ.
   Then
    ◦ Managers are independent => We can use an exponential distribution
      to model their failures. (Srinivasan [143]).
    ◦ Then p, the probability no failures happen within a unit (second) time is
    ◦                        by Srinivasan [143]------------------------------------(2).


    ◦ MTTF of Hasthi is Ѳ/n (according to Baumann [108]) ---------------(3)
30




   Definition: NF(r) = time elapsed for the first continuous
    time interval r with no failures to happen.
   Then h_c = E[NF(r)]
        E[NF(r)] same as the expected value for r
      continuous HEADS to occur with a biased coin
                with p probability of a HEAD.
   It has been shown that                           -----------(4)

   Using (2) and (4), we can calculate h_c = E[Nf(r)].
31




   Similar result to recover from manager failures h_m =
    E[NF(m)].
   We have 1 coordinator and n-1 managers, therefore
                                           -----------------------(5)

   Therefore using h_m and h_c we can find MTTR.
   We know both MTTR (by Equation 5) and MTTF (by
    Equation 3); therefore, we know availability = MTTF / (MTTF
    + MTTR) as a function of Ѳ (MTTF of one Manager).
32




   Parameters
        Ѳ = MTTF of a manager
        r, m continuous time intervals defined by the proof
        n the number managers in the system
   Since our proof provides an upper bound for the
    recovery time, the result is a lower bound for
    availability.
Availability vs. Manager MTTF           33




                       Availability classes
                       defined by Gray et al.

                           Managed
                         Systems (83
                             hours
                        downtime/year)
                         Well Managed
                           Systems (9
                              hours
                         downtime/year)
                        Fault Tolerant
                           Systems
                           (1 hours
                        downtime/yea
                               r)
34
      Robustness: Empirical Results




   Instrument Hasthi to generate events about status, add a new manager,
    kill the current coordinator, and measure the time to detect, to recover
    Hasthi, and to build the meta-model.
   Did the test 100 times. Detection time decreases (O(1/n)), election time
    increases (O(log(n))), recovery time increases, overall time decreases!!
    Recovery time about 80 seconds.
35
Availability of the Managed System




   With LEAD recovery took about 2 minutes (60 + 20 + 30 sec)
   When we calculated, the availability of LEAD with Hasthi is
    0.995 - 0.999, which is about 40-10 hours downtime/ year
36
     Implications Of Our Results
   With Global view of the system, User can author
    management logic the same way they reason about
    the system (easy and Intuitive).
   There is a tradeoff between scalability and explicit
    management logic, but Hasthi covers most usecases
    while supporting explicit user defined management
    logic.
    When building generic management frameworks, it
     is possible to enforce user-defined global and local
       management logic in most real world usecases.
Contributions                                  37

Problem: Enforcing user-defined management logic (that
depend on a global view of the managed system) on
large-scale systems? And Application of such a
framework to manage systems.
  Proposed an architecture to solve this problem (“Manager-Cloud
   Algorithm” + monitoring information as a meta-model of the
   system that exhibits delta-consistency + Decision Framework).
 Proved its robustness analytically and verified it empirically.
 Implemented the architecture and empirically demonstrated that
   it can scale to mange most real world usecases.
   A demonstration that despite its dependency on a global view, a
      Management Framework can scale to manage most real world
      usecases
 Analyzed applications of user-defined management logic to
   manage systems, proposed solutions to management
   complexities arise from these applications, and applied it to
   manage a large-scale e-science project.
38
Questions
39
                   Future Work
   Graphical Composition of Management Logic to
    simplify management logic authoring.
   Building a Distributed Service Container on top of
    Hasthi.
   Making the Coordinator Lightweight, thus try to
    increase the scalability limit of Hasthi.
   Further explore the Application of Management
    Frameworks.
40




Backup Slides
Sensitivity: Rules                                41




   To find sensitivity to rules, 7 Rules sets, each having more
    rules then the one before, with 40,000 resources
   Almost linear Overhead, seem to be stable. We also
    verified by running 100,000 resources against the most
    complex rule set.
Sensitivity: Epoch Time                           42




   Epoch times are time periods between heartbeats and control
    loop evaluations etc, and they decide how fast Hasthi reacts
    to failures.
   Why overhead reduce with smaller epoch? Rete algorithm
    remembers old results and only evaluates new results. Small
    epoch means less changes, which means less overhead!!
43
              Sensitivity: Workload




   Increase failures in the system (increase workload on
    Hasthi) and measure with 40,000 resources.
   Hasthi is stable, why? Hasthi uses a job queue to execute
    actions asynchronously. Therefore, can withstand higher
    workloads and surges.
Useful: LEAD Integration                                   44




   Integrate Hasthi with LEAD. Hasthi recovers LEAD from
    services and host failures and recovers failed workflows.
   A) Killed a service B) killed a host and measured the time
    to detect, trigger actions, new resources to join, and detect
    healthy conditions. Take about 2 minutes to recover the
    system and to know it is healthy.
45
    Comparison With Gadgil et al.




   CGLM evaluates each resource parallely, Hasthi does
    it as a batch.
   Hasthi creates a HTTP connection every time where as
    CGLM uses a pool of connections.
Comparison With Gadgil et al.   46

          Contd.
47
Resource LifeCycle
48
Types of Management Agents
49
In Memory Agent Implementation
Management Action   50

 Implementation
Overhead on a Host in a Test                                 51


          Setup




   Even with 200 services, the host transferred 0.04 MB/s
    out of possible 1Gb/s bandwidth (< 1%) and had 0.02
    load average out of 2.0 (< 2%).

Contenu connexe

Tendances

Transaction Processing Concept
Transaction Processing ConceptTransaction Processing Concept
Transaction Processing ConceptNishant Munjal
 
Kenzan: Architecting for Microservices
Kenzan: Architecting for MicroservicesKenzan: Architecting for Microservices
Kenzan: Architecting for MicroservicesDarren Bathgate
 
Database Management System
Database Management SystemDatabase Management System
Database Management SystemNishant Munjal
 
Sedna XML Database: Transactions and Recovery
Sedna XML Database: Transactions and RecoverySedna XML Database: Transactions and Recovery
Sedna XML Database: Transactions and RecoveryIvan Shcheklein
 

Tendances (6)

ACID Property in DBMS
ACID Property in DBMSACID Property in DBMS
ACID Property in DBMS
 
Transaction Processing Concept
Transaction Processing ConceptTransaction Processing Concept
Transaction Processing Concept
 
Kenzan: Architecting for Microservices
Kenzan: Architecting for MicroservicesKenzan: Architecting for Microservices
Kenzan: Architecting for Microservices
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
Sedna XML Database: Transactions and Recovery
Sedna XML Database: Transactions and RecoverySedna XML Database: Transactions and Recovery
Sedna XML Database: Transactions and Recovery
 
Concurrency Control.
Concurrency Control.Concurrency Control.
Concurrency Control.
 

En vedette

Highland pony society
Highland pony societyHighland pony society
Highland pony societyJane Nixon
 
Managing Operations, Quality and Productivity by nabeel
Managing Operations, Quality and Productivity by nabeelManaging Operations, Quality and Productivity by nabeel
Managing Operations, Quality and Productivity by nabeelNabeel Ehmed
 
Neuronal group selection theory
Neuronal group selection theoryNeuronal group selection theory
Neuronal group selection theoryMohsen Sarhady
 
Ch 18 managing operations
Ch 18 managing operationsCh 18 managing operations
Ch 18 managing operationsNardin A
 

En vedette (6)

1 -managing_operations
1  -managing_operations1  -managing_operations
1 -managing_operations
 
Highland pony society
Highland pony societyHighland pony society
Highland pony society
 
Managing Operations, Quality and Productivity by nabeel
Managing Operations, Quality and Productivity by nabeelManaging Operations, Quality and Productivity by nabeel
Managing Operations, Quality and Productivity by nabeel
 
Pediatrics
PediatricsPediatrics
Pediatrics
 
Neuronal group selection theory
Neuronal group selection theoryNeuronal group selection theory
Neuronal group selection theory
 
Ch 18 managing operations
Ch 18 managing operationsCh 18 managing operations
Ch 18 managing operations
 

Similaire à Dissertation Defence: Enforcing User-Defined Management Logic in Large Scale Systems

Hasthi talk at ICWS 2009
Hasthi talk at ICWS 2009Hasthi talk at ICWS 2009
Hasthi talk at ICWS 2009Srinath Perera
 
JBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic PlatformJBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic Platformelliando dias
 
Untangling cluster management with Helix
Untangling cluster management with HelixUntangling cluster management with Helix
Untangling cluster management with HelixKishore Gopalakrishna
 
Migrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systemsMigrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systemsMarkus Eisele
 
Migrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsMigrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsLightbend
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
Challenges of Cloud Monitoring
Challenges of Cloud MonitoringChallenges of Cloud Monitoring
Challenges of Cloud MonitoringWilliam Pourmajidi
 
Model-Driven Cloud Data Storage
Model-Driven Cloud Data StorageModel-Driven Cloud Data Storage
Model-Driven Cloud Data Storagejccastrejon
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
An efficient scheduling policy for load balancing model for computational gri...
An efficient scheduling policy for load balancing model for computational gri...An efficient scheduling policy for load balancing model for computational gri...
An efficient scheduling policy for load balancing model for computational gri...Alexander Decker
 
Moser lightfoot pmc2012pres
Moser lightfoot pmc2012presMoser lightfoot pmc2012pres
Moser lightfoot pmc2012presNASAPMC
 
Rethink Smalltalk
Rethink SmalltalkRethink Smalltalk
Rethink SmalltalkESUG
 
Saga transactions msa_ architecture
Saga transactions msa_ architectureSaga transactions msa_ architecture
Saga transactions msa_ architectureMauro Vocale
 
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...IJMER
 

Similaire à Dissertation Defence: Enforcing User-Defined Management Logic in Large Scale Systems (20)

Hasthi talk at ICWS 2009
Hasthi talk at ICWS 2009Hasthi talk at ICWS 2009
Hasthi talk at ICWS 2009
 
JBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic PlatformJBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic Platform
 
Untangling cluster management with Helix
Untangling cluster management with HelixUntangling cluster management with Helix
Untangling cluster management with Helix
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Migrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systemsMigrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systems
 
Migrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsMigrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive Systems
 
Noha mega store
Noha mega storeNoha mega store
Noha mega store
 
Comsnets2013
Comsnets2013Comsnets2013
Comsnets2013
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Challenges of Cloud Monitoring
Challenges of Cloud MonitoringChallenges of Cloud Monitoring
Challenges of Cloud Monitoring
 
Model-Driven Cloud Data Storage
Model-Driven Cloud Data StorageModel-Driven Cloud Data Storage
Model-Driven Cloud Data Storage
 
Comparison between Cloud Mirror, Mesos Cluster, and Google Omega
Comparison between Cloud Mirror, Mesos Cluster, and Google OmegaComparison between Cloud Mirror, Mesos Cluster, and Google Omega
Comparison between Cloud Mirror, Mesos Cluster, and Google Omega
 
Computer Science Homework Help
Computer Science Homework HelpComputer Science Homework Help
Computer Science Homework Help
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
An efficient scheduling policy for load balancing model for computational gri...
An efficient scheduling policy for load balancing model for computational gri...An efficient scheduling policy for load balancing model for computational gri...
An efficient scheduling policy for load balancing model for computational gri...
 
Moser lightfoot pmc2012pres
Moser lightfoot pmc2012presMoser lightfoot pmc2012pres
Moser lightfoot pmc2012pres
 
Rethink Smalltalk
Rethink SmalltalkRethink Smalltalk
Rethink Smalltalk
 
Saga transactions msa_ architecture
Saga transactions msa_ architectureSaga transactions msa_ architecture
Saga transactions msa_ architecture
 
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
 
Operating system
Operating systemOperating system
Operating system
 

Plus de Srinath Perera

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingSrinath Perera
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the EnterpriseSrinath Perera
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs Srinath Perera
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsSrinath Perera
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesSrinath Perera
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?Srinath Perera
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsSrinath Perera
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Srinath Perera
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of BlockchainSrinath Perera
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesSrinath Perera
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata EraSrinath Perera
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksSrinath Perera
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeSrinath Perera
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies TimelineSrinath Perera
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsSrinath Perera
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglySrinath Perera
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through AnalyticsSrinath Perera
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySrinath Perera
 

Plus de Srinath Perera (20)

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 

Dernier

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Dernier (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Dissertation Defence: Enforcing User-Defined Management Logic in Large Scale Systems

  • 1. 1 ENFORCING USER- DEFINED MANAGEMENT LOGIC IN LARGE SCALE SYSTEMS Srinath Perera Indiana University, Bloomington
  • 2. 2 Outline  Motivation & the Problem  Related Work  Proposed Architecture  Scalability Results  Robustness  Contributions
  • 3. 3 Motivation: Large Scale systems • IT is becoming a part of our everyday life • Increases size of potential user bases of systems (Google, Facebook, Amazon …). • Information Avalanche. • National, Global scale data collection • Success in this setting is decided by our ability to make sense of this data – scale matters (Google!). • Technological advances • Connectivity , SOA, Complex systems possible. • Computing power everywhere (multicore, smart phones). • Cloud - Lower the barrier for scale. We have the need and means to build large scale systems
  • 4. 4 Building them is Feasible, but Keeping them Running ??  Changes are a norm rather than an exception – “10,000 servers, each having MTTF of thousand days => 10 failures/day” [Jeff Dean].  High Operational Cost - When a system scales up, complexity increases. ◦ More than 75% TCO (Total Cost of Ownership) based on Patterson et al. data. (Dominated by salaries.) ◦ 50% IT budget spent on recovering from failures [Ganek et al.]  Unreliable Middleware - Grid reliability among all operations 55%- 80% [Khalili et al.]. Then the success rate of a service or a workflow that has 6 grid operations is 0.26 !!!  Efforts to avoid failures have been unsuccessful – “Not a problem to be solved, but a fact to cope with” [Patterson] System Management is a Potential Solution to this Problem!!
  • 5. 5
  • 6. 6 Management Framework for Large Scale Systems should  Support user-defined Management Logic ◦ Management usecases differ from system to system ◦ => only big organizations can afford to build specific frameworks ◦ => need user-defined management logic. ◦ Ease of authoring management logic is important.  Scalable  Robust – changes are a norm rather than an exception!  Dynamic - resources often join and leave. Need a dynamic and robust management framework that supports user-defined management logic.
  • 7. The Problem 7  Large scale systems need many managers ◦ One manager does not scale nor robust  Each manager has a Partial view of the system ◦ a subset of resources are assigned to each manager  But a Global view is Preferred (ease of authoring logic) ◦ Logic that work on local data need emergent properties, and hard for user to author them. ◦ We all think in terms of global properties,  Example : “If the system does not have 5 message brokers, create new brokers and connect them to the broker network.” : detect <5 brokers, find the best place to create new one, create new one, and connect it to existing brokers. Problem: Enforcing user-defined management logic (that depends on a global view) on large-scale systems? And Application of such a framework to manage systems.
  • 8. 8 Related Work  Systems without Global Control ◦ Centralized management systems (e.g. Rainbow) ◦ Managers that act independently (e.g. Extreme (Kx), DREAM), and manual coordination (e.g. IBM Tivoli).  Systems with Global Control ◦ Decentralized control - DMonA , and Deugo et al. - ◦ Monitor and run a State Machine of the system - Dubey et al. ◦ Consistent Shared View - Georgiadis et al., component Managers collaborate via total ordered multicast to maintain a system according to architectural constraints.
  • 9. 9 Related Work(Contd.)  Systems with Global Control (Contd.) ◦ Management Hierarchy  Management hierarchy where the topmost layer is replicated (E.g. Monalisa ,Gadgil et al.).  Typically Aggregation is used at each level.  Aggregation hides information about a single resource. ◦ Hierarchy with Policies  WildCat - agent group based hierarchy that communicates via whiteboards and use policies to control agents. Authors concern about the scalability of whiteboards. ◦ Cooperating Managers - No Global control loop  Schoenwaelder - a group of cooperating agents and a master agent (IP multi-cast)  ANDREA - create dynamic Hierarchies, delegate tasks to other managers via delegate statements in the management logic.
  • 10. Approach Scalabl Robust Ease of Problems e Writing 10 management logic Decentralized Highly Yes Hard Hard for users to write control (e.g. DMonA rules to achieve emergent , and Deugo et al .) behavior Complex Event Yes Possibl Not Easy Event model has limited processing e Memory (DREAM) Consistent view No Yes Yes Need ordered reliable across managers multicast – does not scale (e.g. Georgiadis et al. ) Hierarchical control Highly Possibl Not Easy Lose identity of a single with aggregation e resource due to (Monalisa) aggregation Hierarchy with Yes Possibl Possible Policies are not as explicit Policies at each e as rules. level (e. g. WildCat) State Machine Yes Possibl Not Easy Users have to construct (Dubey et al. ) e this state machine, which
  • 11. 11 Outline of the Evidence  Solution: Hasthi Architecture  Useful ◦ Application to a Large-Scale E-Science Project (LEAD)  Sound ◦ Scalable (Empirical results) ◦ Robust and Dynamic (Proof + Empirical results)  Main Contribution “Proposing, implementing, and analyzing a dynamic and robust management architecture, which can manage large-scale systems by enforcing user-defined management logic that depend on a global view of the managed system state, and application of the management logic to manage systems.”
  • 12. 12 Big Picture (Hasthi)  Hasthi Has three Parts  Manager Cloud – distributed architecture that binds managers and resources in the system as one cohesive unit.  Meta-Model that represents the system state.  Decision Framework.
  • 13. Manager Cloud 13  Managers form a P2P network (Pastry), which is used for Initialization and Recovery (Elections).  Normal Operations use SOAP over HTTP
  • 14. 14
  • 15. Meta-Model 15  Meta-model represents the monitoring data collected from the system. Summarized meta-model provides a global view.  Delta-consistency – changes are reflected within a bounded time (a concept borrowed from shared memory multiprocessors [see Singla et al.]).
  • 16. Decision Framework 16  Users define management logic as rules: Local and Global.  Manager control loops evaluate partial meta-models using local rules.  The coordinator control loop evaluates the summarized meta-models using global rules (Global view).  Actions triggered by rules analyze meta-model and decide on solutions.
  • 17. Management Rules 17  Rules (Drools) evaluate meta-objects (which represent resources) and execute actions, which analyze meta-objects and decide on solutions. rule "RestartFailedServices" when service:ManagedService(state == "CrashedState"); host:Host(state != "CrashedState", service.host == name); then system.invoke(new RestartAction(service), new ActionCallback() { public void actionSucessful(ManagementAction action) { ..... } public void actionFailed(ManagementAction action,Throwable e) { service.setState("UnRepairableState"); system.invoke( new UserInteractionAction(system, service, action,e)); }}); end  When the condition given using the object query language is met, actions in the then-clause are carried out.  Use Rete algorithm to evaluate meta-objects and execute corrective actions. Tradeoff between space and time.
  • 18. Management Actions 18  Action Types 1. Create a New service 2. Restart a running service or recover a failed service 3. Relocate a service 4. Tune and configure a resource – change the configuration of a resource or change the structure of the system. 5. User Interaction Action  Actions implementation: ◦ Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host. ◦ Use Hasthi Agent integrated with each resource.  Hasthi provides default management actions, but users can write their own.
  • 19. Management Complexities 19 Even with a Global view, management can go wrong in many ways. Following are some complexities and proposed remedies (Chapter 7 for details). 1. Failed Management Actions– Hasthi uses the resource lifecycle, which sets resource state as “Unrecoverable” if an action failed, and ask for user help. 2. Lost system structure (broken links) – services can use the “dependency-discovery” operation to find other services. 3. Lost state – Hasthi does not preserve state but helps resources to locate their storage locations. (resource expose the location as a property and Hasthi pass it as a argument when it recovers the services) 4. Lost messages – retry + session level checkpoints 5. Fail positives (Custom failure detectors) & Network Paritions
  • 20. 20 Application of Hasthi Find 10% Errors that happen 90% of the time Figure Out how to preserve state across changes
  • 21. 21 LEAD Usecase  LEAD services are stateless or have a persistent state. Data storage is best effort. We can recover by restarting services.  Recover from Host & Service Failures – restart the failed services  Recover workflows - Detect when the system has failed and recovered and resurrect any failed Workflows.
  • 22. Scalability: Test Setup 22 Q? Main Test Setup Coordinator Test Setup:  Large scale deployment of LEAD.  Test-Manager that simulates all  Multiple replicas of the complete LEAD messages generated by a normal stack. manager managing a set of  Each service simulates a management resources. workload using a randomized  We simulated a large-scale algorithm. system using Test-Managers.  Set of rules to manage the system,  The coordinator does not see a and each test ran for a 1 hour with 30 difference. seconds epoch time.
  • 24. One Manager Overhead (Resource Heartbeat Latency, Manager Loop Overhead, Manager Heartbeat Latency) 24 Managers Overhead (Coordinator Loop, Manager Heartbeat )  One manager scales to 5000-8000 resources, Hasthi scales more with added managers. Need more tests to find the limits.
  • 25. Coordinator Limit: (Manager Heartbeat Latency, 25 Coordinator Loop Overhead) vs. Resource count  Close to a Linear overhead, the coordinator scales to 100,000 resources and 1000 managers, and the number of managers does not make a much difference.  Why? (1) Summarization, (2) Only transfer Changes, (3) Rete Algorithm, which only evaluates changes (tradeoff between speed vs. memory).
  • 26. Manager Independence: (Resource heartbeat, Manager Loop vs. Manager Heartbeat) vs. resources per Manager 26  We measured the limit of a manager and the limit of the coordinator.  Hypothesis: a manager overhead only depends on resources assigned to a manager, not on other managers or resources in the system  we can scale up Hasthi (e.g. 100 managers, 1000 resources each).  Verify Hypothesis:  A Scatter Plot: overhead vs. number of resources per Manager.  Same X values are reasonably close to each other.  Hypothesis is valid till 2000 resources at least.  Why? Managers do not usually interact with other managers, but talk with the coordinator.
  • 27. 27 Scalability: Summary 1. One manager scales to 5000-8000 resources. 2. Managers only depend on resources assigned to them (at least till 2000 resources) and are not affected by other Managers in the system. 3. Coordinator scales to 100,000 resources and 1000 managers (100-1000 resources per manager < 2000 limit in #2). Q? System scales to 100,000 resources.
  • 28. Robustness: Correctness Proof 28 Self Stabilization = the system reaches a safe state regardless of the initial state and continues to be at that state.  We proved (in Chapter 5) given a system managed with Hasthi there exists a constant h for that system such that Hasthi Self Stabilizes if managers do not join or leave and communication failures do not happen for a continuous h time interval.  Proof Outline: We took all states and proved that for any state there is a forced sequence that recovers the system within a bounded time.
  • 29. 29 Availability of Hasthi  Availability = MTTF/(MTTF+MTTR) -----------------------------------(1). The Proof provides the recovery time. Let us use that to calculate Availability as a function of MTTF of a single manager.  Let us Assume a system managed with n independent managers each manager having MTTF (Mean Time To Failure) of Ѳ.  Then ◦ Managers are independent => We can use an exponential distribution to model their failures. (Srinivasan [143]). ◦ Then p, the probability no failures happen within a unit (second) time is ◦ by Srinivasan [143]------------------------------------(2). ◦ MTTF of Hasthi is Ѳ/n (according to Baumann [108]) ---------------(3)
  • 30. 30  Definition: NF(r) = time elapsed for the first continuous time interval r with no failures to happen.  Then h_c = E[NF(r)] E[NF(r)] same as the expected value for r continuous HEADS to occur with a biased coin with p probability of a HEAD.  It has been shown that -----------(4)  Using (2) and (4), we can calculate h_c = E[Nf(r)].
  • 31. 31  Similar result to recover from manager failures h_m = E[NF(m)].  We have 1 coordinator and n-1 managers, therefore  -----------------------(5)  Therefore using h_m and h_c we can find MTTR.  We know both MTTR (by Equation 5) and MTTF (by Equation 3); therefore, we know availability = MTTF / (MTTF + MTTR) as a function of Ѳ (MTTF of one Manager).
  • 32. 32  Parameters  Ѳ = MTTF of a manager  r, m continuous time intervals defined by the proof  n the number managers in the system  Since our proof provides an upper bound for the recovery time, the result is a lower bound for availability.
  • 33. Availability vs. Manager MTTF 33 Availability classes defined by Gray et al. Managed Systems (83 hours downtime/year) Well Managed Systems (9 hours downtime/year) Fault Tolerant Systems (1 hours downtime/yea r)
  • 34. 34 Robustness: Empirical Results  Instrument Hasthi to generate events about status, add a new manager, kill the current coordinator, and measure the time to detect, to recover Hasthi, and to build the meta-model.  Did the test 100 times. Detection time decreases (O(1/n)), election time increases (O(log(n))), recovery time increases, overall time decreases!! Recovery time about 80 seconds.
  • 35. 35 Availability of the Managed System  With LEAD recovery took about 2 minutes (60 + 20 + 30 sec)  When we calculated, the availability of LEAD with Hasthi is 0.995 - 0.999, which is about 40-10 hours downtime/ year
  • 36. 36 Implications Of Our Results  With Global view of the system, User can author management logic the same way they reason about the system (easy and Intuitive).  There is a tradeoff between scalability and explicit management logic, but Hasthi covers most usecases while supporting explicit user defined management logic. When building generic management frameworks, it is possible to enforce user-defined global and local management logic in most real world usecases.
  • 37. Contributions 37 Problem: Enforcing user-defined management logic (that depend on a global view of the managed system) on large-scale systems? And Application of such a framework to manage systems.  Proposed an architecture to solve this problem (“Manager-Cloud Algorithm” + monitoring information as a meta-model of the system that exhibits delta-consistency + Decision Framework).  Proved its robustness analytically and verified it empirically.  Implemented the architecture and empirically demonstrated that it can scale to mange most real world usecases.  A demonstration that despite its dependency on a global view, a Management Framework can scale to manage most real world usecases  Analyzed applications of user-defined management logic to manage systems, proposed solutions to management complexities arise from these applications, and applied it to manage a large-scale e-science project.
  • 39. 39 Future Work  Graphical Composition of Management Logic to simplify management logic authoring.  Building a Distributed Service Container on top of Hasthi.  Making the Coordinator Lightweight, thus try to increase the scalability limit of Hasthi.  Further explore the Application of Management Frameworks.
  • 41. Sensitivity: Rules 41  To find sensitivity to rules, 7 Rules sets, each having more rules then the one before, with 40,000 resources  Almost linear Overhead, seem to be stable. We also verified by running 100,000 resources against the most complex rule set.
  • 42. Sensitivity: Epoch Time 42  Epoch times are time periods between heartbeats and control loop evaluations etc, and they decide how fast Hasthi reacts to failures.  Why overhead reduce with smaller epoch? Rete algorithm remembers old results and only evaluates new results. Small epoch means less changes, which means less overhead!!
  • 43. 43 Sensitivity: Workload  Increase failures in the system (increase workload on Hasthi) and measure with 40,000 resources.  Hasthi is stable, why? Hasthi uses a job queue to execute actions asynchronously. Therefore, can withstand higher workloads and surges.
  • 44. Useful: LEAD Integration 44  Integrate Hasthi with LEAD. Hasthi recovers LEAD from services and host failures and recovers failed workflows.  A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy.
  • 45. 45 Comparison With Gadgil et al.  CGLM evaluates each resource parallely, Hasthi does it as a batch.  Hasthi creates a HTTP connection every time where as CGLM uses a pool of connections.
  • 46. Comparison With Gadgil et al. 46 Contd.
  • 49. 49 In Memory Agent Implementation
  • 50. Management Action 50 Implementation
  • 51. Overhead on a Host in a Test 51 Setup  Even with 200 services, the host transferred 0.04 MB/s out of possible 1Gb/s bandwidth (< 1%) and had 0.02 load average out of 2.0 (< 2%).

Notes de l'éditeur

  1. IT has become indispensable part of our lifeMillions of users => increase userbases => Big deal to Google or Amazon Successes depend on ability to make sense of data => Killer app of our time is Search, “Large Scale Search”
  2. I said it is possible to build large scale systems, keeping them running is a different story!! Most of us have come to contact with large scale systems and no what it takesChanges->Operational cost->Unreliable MiddlewareMany solutions -> Quote Patterson -> System management as a potential solutions
  3. Role of system management more like Human manager, watch over and control the system. We call the system being managed, a “managed system”I focus on Monitor -> Decide -> ExecuteNOT on specifications, rather how to use exposed data
  4. Management usecases differ system to systemGoogle can afford to build their own management framework, but medium and small organizations, which we said going to have large scale systems, can’tJust like we choose a WS middleware, they need to go and pick a management framework and configure it to manage their system.
  5. Centralized managers avoid the problem
  6. ANDREA – create a dynamic hierarchy to resolve issues
  7. This is outline
  8. Resources, Managers, a special manager (Coordinator) Elected Among them, Bootstrap nodes (the entry point, not shown in the figure)Heartbeats , Resources => Manager, Manager => CoordinatorCoordinator failed?Manager Failed?Resource Failed?
  9. k.
  10. “Convergence Stairs” proo
  11. 221 sec proof value!