Hasthi is a dynamic and robust management architecture that can enforce user-defined management logic on large-scale systems. It consists of a manager cloud, a meta-model representing system state, and a decision framework. The manager cloud forms a P2P network and uses the meta-model and rules to evaluate the system state and trigger corrective actions. Empirical results show Hasthi scales to 100,000 resources and proves to self-stabilize within a bounded time. Availability calculations also show Hasthi achieves high availability even as components fail.
2. 2
Outline
Motivation & the Problem
Related Work
Proposed Architecture
Scalability Results
Robustness
Contributions
3. 3
Motivation: Large Scale systems
• IT is becoming a part of our everyday life
• Increases size of potential user bases of systems
(Google, Facebook, Amazon …).
• Information Avalanche.
• National, Global scale data collection
• Success in this setting is decided by our ability to make
sense of this data – scale matters (Google!).
• Technological advances
• Connectivity , SOA, Complex systems possible.
• Computing power everywhere (multicore, smart phones).
• Cloud - Lower the barrier for scale.
We have the need and means to build large
scale systems
4. 4
Building them is Feasible, but
Keeping them Running ??
Changes are a norm rather than an exception – “10,000 servers,
each having MTTF of thousand days => 10 failures/day” [Jeff Dean].
High Operational Cost - When a system scales up, complexity
increases.
◦ More than 75% TCO (Total Cost of Ownership) based on Patterson et al.
data. (Dominated by salaries.)
◦ 50% IT budget spent on recovering from failures [Ganek et al.]
Unreliable Middleware - Grid reliability among all operations 55%-
80% [Khalili et al.]. Then the success rate of a service or a workflow
that has 6 grid operations is 0.26 !!!
Efforts to avoid failures have been unsuccessful – “Not a problem to
be solved, but a fact to cope with” [Patterson]
System Management is a Potential Solution to
this Problem!!
6. 6
Management Framework for
Large Scale Systems should
Support user-defined Management Logic
◦ Management usecases differ from system to system
◦ => only big organizations can afford to build specific frameworks
◦ => need user-defined management logic.
◦ Ease of authoring management logic is important.
Scalable
Robust – changes are a norm rather than an exception!
Dynamic - resources often join and leave.
Need a dynamic and robust management framework that
supports user-defined management logic.
7. The Problem 7
Large scale systems need many managers
◦ One manager does not scale nor robust
Each manager has a Partial view of the system
◦ a subset of resources are assigned to each manager
But a Global view is Preferred (ease of authoring logic)
◦ Logic that work on local data need emergent properties, and
hard for user to author them.
◦ We all think in terms of global properties,
Example : “If the system does not have 5 message brokers,
create new brokers and connect them to the broker
network.” : detect <5 brokers, find the best place to create
new one, create new one, and connect it to existing brokers.
Problem: Enforcing user-defined management logic
(that depends on a global view) on large-scale
systems? And Application of such a framework to
manage systems.
8. 8
Related Work
Systems without Global Control
◦ Centralized management systems (e.g. Rainbow)
◦ Managers that act independently (e.g. Extreme (Kx),
DREAM), and manual coordination (e.g. IBM Tivoli).
Systems with Global Control
◦ Decentralized control - DMonA , and Deugo et al. -
◦ Monitor and run a State Machine of the system - Dubey
et al.
◦ Consistent Shared View - Georgiadis et al., component
Managers collaborate via total ordered multicast to
maintain a system according to architectural constraints.
9. 9
Related Work(Contd.)
Systems with Global Control (Contd.)
◦ Management Hierarchy
Management hierarchy where the topmost layer is replicated
(E.g. Monalisa ,Gadgil et al.).
Typically Aggregation is used at each level.
Aggregation hides information about a single resource.
◦ Hierarchy with Policies
WildCat - agent group based hierarchy that communicates via
whiteboards and use policies to control agents. Authors
concern about the scalability of whiteboards.
◦ Cooperating Managers - No Global control loop
Schoenwaelder - a group of cooperating agents and a master
agent (IP multi-cast)
ANDREA - create dynamic Hierarchies, delegate tasks to
other managers via delegate statements in the management
logic.
10. Approach Scalabl Robust Ease of Problems
e Writing 10
management
logic
Decentralized Highly Yes Hard Hard for users to write
control (e.g. DMonA rules to achieve emergent
, and Deugo et al .) behavior
Complex Event Yes Possibl Not Easy Event model has limited
processing e Memory
(DREAM)
Consistent view No Yes Yes Need ordered reliable
across managers multicast – does not scale
(e.g. Georgiadis et
al. )
Hierarchical control Highly Possibl Not Easy Lose identity of a single
with aggregation e resource due to
(Monalisa) aggregation
Hierarchy with Yes Possibl Possible Policies are not as explicit
Policies at each e as rules.
level (e. g. WildCat)
State Machine Yes Possibl Not Easy Users have to construct
(Dubey et al. ) e this state machine, which
11. 11
Outline of the Evidence
Solution: Hasthi Architecture
Useful
◦ Application to a Large-Scale E-Science Project (LEAD)
Sound
◦ Scalable (Empirical results)
◦ Robust and Dynamic (Proof + Empirical results)
Main Contribution
“Proposing, implementing, and analyzing a
dynamic and robust management architecture,
which can manage large-scale systems by
enforcing user-defined management logic that
depend on a global view of the managed system
state, and application of the management logic to
manage systems.”
12. 12
Big Picture (Hasthi)
Hasthi Has three Parts
Manager Cloud – distributed architecture that binds managers
and resources in the system as one cohesive unit.
Meta-Model that represents the system state.
Decision Framework.
13. Manager Cloud 13
Managers form a P2P network (Pastry), which is used for
Initialization and Recovery (Elections).
Normal Operations use SOAP over HTTP
15. Meta-Model 15
Meta-model represents the monitoring data collected from the system.
Summarized meta-model provides a global view.
Delta-consistency – changes are reflected within a bounded time (a
concept borrowed from shared memory multiprocessors [see Singla et
al.]).
16. Decision Framework 16
Users define management logic as rules: Local and Global.
Manager control loops evaluate partial meta-models using local rules.
The coordinator control loop evaluates the summarized meta-models
using global rules (Global view).
Actions triggered by rules analyze meta-model and decide on solutions.
17. Management Rules 17
Rules (Drools) evaluate meta-objects (which represent resources) and
execute actions, which analyze meta-objects and decide on solutions.
rule "RestartFailedServices"
when
service:ManagedService(state == "CrashedState");
host:Host(state != "CrashedState", service.host == name);
then
system.invoke(new RestartAction(service),
new ActionCallback() {
public void actionSucessful(ManagementAction action) { ..... }
public void actionFailed(ManagementAction action,Throwable e) {
service.setState("UnRepairableState");
system.invoke(
new UserInteractionAction(system, service, action,e));
}});
end
When the condition given using the object query language is met,
actions in the then-clause are carried out.
Use Rete algorithm to evaluate meta-objects and execute corrective
actions. Tradeoff between space and time.
18. Management Actions 18
Action Types
1. Create a New service
2. Restart a running service or recover a failed service
3. Relocate a service
4. Tune and configure a resource – change the configuration
of a resource or change the structure of the system.
5. User Interaction Action
Actions implementation:
◦ Use shell scripts (e.g. service start or stop) and execute
them using a Host Agent running in each host.
◦ Use Hasthi Agent integrated with each resource.
Hasthi provides default management actions, but
users can write their own.
19. Management Complexities 19
Even with a Global view, management can go wrong in many
ways. Following are some complexities and proposed
remedies (Chapter 7 for details).
1. Failed Management Actions– Hasthi uses the resource
lifecycle, which sets resource state as “Unrecoverable” if an
action failed, and ask for user help.
2. Lost system structure (broken links) – services can use the
“dependency-discovery” operation to find other services.
3. Lost state – Hasthi does not preserve state but helps
resources to locate their storage locations. (resource expose
the location as a property and Hasthi pass it as a argument
when it recovers the services)
4. Lost messages – retry + session level checkpoints
5. Fail positives (Custom failure detectors) & Network Paritions
20. 20
Application of Hasthi
Find 10%
Errors that
happen 90% of
the time
Figure Out how
to preserve
state across
changes
21. 21
LEAD
Usecase
LEAD services are stateless or have a persistent state. Data storage
is best effort. We can recover by restarting services.
Recover from Host & Service Failures – restart the failed services
Recover workflows - Detect when the system has failed and
recovered and resurrect any failed Workflows.
22. Scalability: Test Setup 22
Q?
Main Test Setup Coordinator Test Setup:
Large scale deployment of LEAD. Test-Manager that simulates all
Multiple replicas of the complete LEAD messages generated by a normal
stack. manager managing a set of
Each service simulates a management resources.
workload using a randomized We simulated a large-scale
algorithm. system using Test-Managers.
Set of rules to manage the system, The coordinator does not see a
and each test ran for a 1 hour with 30
difference.
seconds epoch time.
24. One Manager Overhead (Resource Heartbeat Latency,
Manager Loop Overhead, Manager Heartbeat Latency) 24
Managers Overhead (Coordinator Loop, Manager Heartbeat )
One manager scales to 5000-8000 resources, Hasthi scales more with
added managers. Need more tests to find the limits.
25. Coordinator Limit: (Manager Heartbeat Latency, 25
Coordinator Loop Overhead) vs. Resource count
Close to a Linear overhead, the coordinator scales to 100,000
resources and 1000 managers, and the number of managers does not
make a much difference.
Why? (1) Summarization, (2) Only transfer Changes, (3) Rete
Algorithm, which only evaluates changes (tradeoff between speed vs.
memory).
26. Manager Independence: (Resource heartbeat, Manager
Loop vs. Manager Heartbeat) vs. resources per Manager 26
We measured the limit of a manager and the limit of the coordinator.
Hypothesis: a manager overhead only depends on resources assigned
to a manager, not on other managers or resources in the system
we can scale up Hasthi (e.g. 100 managers, 1000 resources each).
Verify Hypothesis:
A Scatter Plot: overhead vs. number of resources per Manager.
Same X values are reasonably close to each other.
Hypothesis is valid till 2000 resources at least.
Why? Managers do not usually interact with other managers, but talk
with the coordinator.
27. 27
Scalability: Summary
1. One manager scales to 5000-8000 resources.
2. Managers only depend on resources assigned to
them (at least till 2000 resources) and are not
affected by other Managers in the system.
3. Coordinator scales to 100,000 resources and 1000
managers (100-1000 resources per manager < 2000
limit in #2).
Q?
System scales to 100,000 resources.
28. Robustness: Correctness Proof 28
Self Stabilization = the system reaches a safe state regardless of the initial
state and continues to be at that state.
We proved (in Chapter 5) given a system managed with Hasthi there
exists a constant h for that system such that Hasthi Self Stabilizes if
managers do not join or leave and communication failures do not happen
for a continuous h time interval.
Proof Outline: We took all states and proved that for any state there is a
forced sequence that recovers the system within a bounded time.
29. 29
Availability of Hasthi
Availability = MTTF/(MTTF+MTTR) -----------------------------------(1).
The Proof provides the recovery time. Let us use that to calculate
Availability as a function of MTTF of a single manager.
Let us Assume a system managed with n independent managers
each manager having MTTF (Mean Time To Failure) of Ѳ.
Then
◦ Managers are independent => We can use an exponential distribution
to model their failures. (Srinivasan [143]).
◦ Then p, the probability no failures happen within a unit (second) time is
◦ by Srinivasan [143]------------------------------------(2).
◦ MTTF of Hasthi is Ѳ/n (according to Baumann [108]) ---------------(3)
30. 30
Definition: NF(r) = time elapsed for the first continuous
time interval r with no failures to happen.
Then h_c = E[NF(r)]
E[NF(r)] same as the expected value for r
continuous HEADS to occur with a biased coin
with p probability of a HEAD.
It has been shown that -----------(4)
Using (2) and (4), we can calculate h_c = E[Nf(r)].
31. 31
Similar result to recover from manager failures h_m =
E[NF(m)].
We have 1 coordinator and n-1 managers, therefore
-----------------------(5)
Therefore using h_m and h_c we can find MTTR.
We know both MTTR (by Equation 5) and MTTF (by
Equation 3); therefore, we know availability = MTTF / (MTTF
+ MTTR) as a function of Ѳ (MTTF of one Manager).
32. 32
Parameters
Ѳ = MTTF of a manager
r, m continuous time intervals defined by the proof
n the number managers in the system
Since our proof provides an upper bound for the
recovery time, the result is a lower bound for
availability.
33. Availability vs. Manager MTTF 33
Availability classes
defined by Gray et al.
Managed
Systems (83
hours
downtime/year)
Well Managed
Systems (9
hours
downtime/year)
Fault Tolerant
Systems
(1 hours
downtime/yea
r)
34. 34
Robustness: Empirical Results
Instrument Hasthi to generate events about status, add a new manager,
kill the current coordinator, and measure the time to detect, to recover
Hasthi, and to build the meta-model.
Did the test 100 times. Detection time decreases (O(1/n)), election time
increases (O(log(n))), recovery time increases, overall time decreases!!
Recovery time about 80 seconds.
35. 35
Availability of the Managed System
With LEAD recovery took about 2 minutes (60 + 20 + 30 sec)
When we calculated, the availability of LEAD with Hasthi is
0.995 - 0.999, which is about 40-10 hours downtime/ year
36. 36
Implications Of Our Results
With Global view of the system, User can author
management logic the same way they reason about
the system (easy and Intuitive).
There is a tradeoff between scalability and explicit
management logic, but Hasthi covers most usecases
while supporting explicit user defined management
logic.
When building generic management frameworks, it
is possible to enforce user-defined global and local
management logic in most real world usecases.
37. Contributions 37
Problem: Enforcing user-defined management logic (that
depend on a global view of the managed system) on
large-scale systems? And Application of such a
framework to manage systems.
Proposed an architecture to solve this problem (“Manager-Cloud
Algorithm” + monitoring information as a meta-model of the
system that exhibits delta-consistency + Decision Framework).
Proved its robustness analytically and verified it empirically.
Implemented the architecture and empirically demonstrated that
it can scale to mange most real world usecases.
A demonstration that despite its dependency on a global view, a
Management Framework can scale to manage most real world
usecases
Analyzed applications of user-defined management logic to
manage systems, proposed solutions to management
complexities arise from these applications, and applied it to
manage a large-scale e-science project.
39. 39
Future Work
Graphical Composition of Management Logic to
simplify management logic authoring.
Building a Distributed Service Container on top of
Hasthi.
Making the Coordinator Lightweight, thus try to
increase the scalability limit of Hasthi.
Further explore the Application of Management
Frameworks.
41. Sensitivity: Rules 41
To find sensitivity to rules, 7 Rules sets, each having more
rules then the one before, with 40,000 resources
Almost linear Overhead, seem to be stable. We also
verified by running 100,000 resources against the most
complex rule set.
42. Sensitivity: Epoch Time 42
Epoch times are time periods between heartbeats and control
loop evaluations etc, and they decide how fast Hasthi reacts
to failures.
Why overhead reduce with smaller epoch? Rete algorithm
remembers old results and only evaluates new results. Small
epoch means less changes, which means less overhead!!
43. 43
Sensitivity: Workload
Increase failures in the system (increase workload on
Hasthi) and measure with 40,000 resources.
Hasthi is stable, why? Hasthi uses a job queue to execute
actions asynchronously. Therefore, can withstand higher
workloads and surges.
44. Useful: LEAD Integration 44
Integrate Hasthi with LEAD. Hasthi recovers LEAD from
services and host failures and recovers failed workflows.
A) Killed a service B) killed a host and measured the time
to detect, trigger actions, new resources to join, and detect
healthy conditions. Take about 2 minutes to recover the
system and to know it is healthy.
45. 45
Comparison With Gadgil et al.
CGLM evaluates each resource parallely, Hasthi does
it as a batch.
Hasthi creates a HTTP connection every time where as
CGLM uses a pool of connections.
51. Overhead on a Host in a Test 51
Setup
Even with 200 services, the host transferred 0.04 MB/s
out of possible 1Gb/s bandwidth (< 1%) and had 0.02
load average out of 2.0 (< 2%).
Notes de l'éditeur
IT has become indispensable part of our lifeMillions of users => increase userbases => Big deal to Google or Amazon Successes depend on ability to make sense of data => Killer app of our time is Search, “Large Scale Search”
I said it is possible to build large scale systems, keeping them running is a different story!! Most of us have come to contact with large scale systems and no what it takesChanges->Operational cost->Unreliable MiddlewareMany solutions -> Quote Patterson -> System management as a potential solutions
Role of system management more like Human manager, watch over and control the system. We call the system being managed, a “managed system”I focus on Monitor -> Decide -> ExecuteNOT on specifications, rather how to use exposed data
Management usecases differ system to systemGoogle can afford to build their own management framework, but medium and small organizations, which we said going to have large scale systems, can’tJust like we choose a WS middleware, they need to go and pick a management framework and configure it to manage their system.
Centralized managers avoid the problem
ANDREA – create a dynamic hierarchy to resolve issues
This is outline
Resources, Managers, a special manager (Coordinator) Elected Among them, Bootstrap nodes (the entry point, not shown in the figure)Heartbeats , Resources => Manager, Manager => CoordinatorCoordinator failed?Manager Failed?Resource Failed?