This document discusses challenges in deploying distributed applications on cloud infrastructure and proposes an autonomic framework called Scarce to address them. It notes that applications' components may be unevenly distributed across virtual machines with varying performance. Scarce uses autonomous agents that monitor components and make placement decisions. It employs an economic model where servers charge components rent based on resource usage, and components aim to maximize their balance of earnings and costs through actions like replication and migration. The framework also propagates service level agreements from parent to child components and automatically provisions resources to ensure performance guarantees are met under varying load. Evaluation results demonstrate its ability to adapt to changing loads and failures while maintaining scalability.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Autonomic SLA-driven Provisioning for Cloud Applications
1. Autonomic SLA-driven
Provisioning for Cloud
Applications
Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer
CCGRID 2011, May 23-26 2011, New Port Beach, CA, USA
nicolas.bonvin@epfl.ch
LSIR - EPFL
2. Cloud Apps – Issue #1 : Placement
● A distributed, component-based application running on an elastic
infrastructure
C1
C1 C2
C2 C3
C3 C4
C4
2 EPFL – LSIR - Nicolas Bonvin
3. Cloud Apps – Issue #1 : Placement
● A distributed, component-based application running on an elastic
infrastructure
C1
C1 C2
C2 C3
C3 C4
C4
VM1 VM2 VM3
3 EPFL – LSIR - Nicolas Bonvin
4. Cloud Apps – Issue #1 : Placement
● A distributed, component-based application running on an elastic
infrastructure
● Performance of C1, C2 and C3 is probably less than C4
● No info on other VMs colocated on same server !
C1
C1 C2
C2 C3
C3 C4
C4
VM1 VM2 VM3
Server 1 Server 2
4 EPFL – LSIR - Nicolas Bonvin
5. Cloud Apps – Issue #1 : Placement
● A distributed, component-based application running on an elastic
infrastructure
● Performance of C1, C2 and C3 is probably less than C4
● No info on other VMs colocated on same server !
C1
C1 C2
C2 C3
C3 C4
C4
VM1 VM2 VM3
Server 1 Server 2
No control on placement
5 EPFL – LSIR - Nicolas Bonvin
6. Cloud Apps – Issue #2 : Unstability
● Load-balanced trafic to 4 identical components on 4 identical VMs
C1
C1 C1
C1 C1
C1 C1
C1
VM1 VM2 VM3 VM4
100 ms 100 ms 100 ms 100 ms
6 EPFL – LSIR - Nicolas Bonvin
7. Cloud Apps – Issue #2 : Unstability
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...
C1
C1 C1
C1 C1
C1 C1
C1
VM1 VM2 VM3 VM4
100 ms 140 ms 100 ms 100 ms
7 EPFL – LSIR - Nicolas Bonvin
8. Cloud Apps – Issue #2 : Unstability
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...
● Component overloaded
C1
C1 C1
C1 C1
C1 C1
C1
VM1 VM2 VM3 VM4
130 ms 140 ms 100 ms 100 ms
8 EPFL – LSIR - Nicolas Bonvin
9. Cloud Apps – Issue #2 : Unstability
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...
● Component overloaded
● Component bug, crash, deadlock, ...
C1
C1 C1
C1 C1
C1 C1
C1
VM1 VM2 VM3 VM4
130 ms 140 ms 100 ms infinity
9 EPFL – LSIR - Nicolas Bonvin
10. Cloud Apps – Issue #2 : Unstability
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...
● Component overloaded
● Component bug, crash, deadlock, ...
● Failure of C1 on VM4 -> load is rebalanced
C1
C1 C1
C1 C1
C1 C1
C1
VM1 VM2 VM3 VM4
140 ms 150 ms 130 ms infinity
10 EPFL – LSIR - Nicolas Bonvin
11. Cloud Apps – Issue #2 : Unstability
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...
● Component overloaded
● Component bug, crash, deadlock, ...
● Failure of C1 on VM4 -> load is rebalanced
C1
C1 C1
C1 C1
C1 C1
C1
VM1 VM2 VM3 VM4
140 ms 150 ms 130 ms infinity
Application should react early !
11 EPFL – LSIR - Nicolas Bonvin
12. Cloud Apps – Overview
● Build for failures
– Do not trust the underlying infrastructure
– Do not trust your components either !
● Components should adapt to the changing conditions
– Quickly
– Automatically
– e.g. by replacing a wonky VM by a new one
12 EPFL – LSIR - Nicolas Bonvin
14. Architecture Overview
● An agent on each server / VM
– starts/stops/monitors the components
– Takes decisions on behalf of the components
● An agent communicates with other agents
– Routing table
– Status of the server (resources usage)
Server Agent
Agent
A
B Agent GOSSIPING
+ BROADCAST
Agent
Agent
E
Agent
14 EPFL – LSIR - Nicolas Bonvin
15. An economic approach
● Time is split into epochs (no synchronization between servers)
● Servers charge a virtual rent for hosting a component according to
– Current resource usage (I/O, CPU, ...) of the server
– Technical factors (HW, connectivity, ...)
– Non-technical factors (country stability, ....)
15 EPFL – LSIR - Nicolas Bonvin
16. An economic approach
● Time is split into epochs (no synchronization between servers)
● Servers charge a virtual rent for hosting a component according to
– Current resource usage (I/O, CPU, ...) of the server
– Technical factors (HW, connectivity, ...)
– Non-technical factors (country stability, ....)
● Components
– Pay virtual rent at each epoch
– Gain virtual money by processing requests
– Take decisions based on balance ( = gain – rent )
● Replicate, migrate, suicide, stay
● Virtual rents are updated by gossiping (no centralized board)
16 EPFL – LSIR - Nicolas Bonvin
17. Economic model (i)
● The rent of a server is different for each component !
17 EPFL – LSIR - Nicolas Bonvin
18. Economic model (ii)
CPU : 70%
I/O : 20%
VM1
CPU : 30%
I/O : 5%
C1
C1 ?
CPU : 25%
I/O : 65%
VM2
● VM1 and VM2 have an « identical » resources usage : 45%
● Server rent = server's resources usage with component's weights
– Rent for C1 @ VM1 > rent for C1 @ VM2
Multiplexing of server resources
18 EPFL – LSIR - Nicolas Bonvin
19. Economic model (iii)
● Choosing a candidate server j during replication/migration of a
component i
– netbenefit maximization
● 2 optimization goals :
– high-availability by geographical diversity of replicas
– low latency by grouping related components
● gj : weight related to the proximity of the server location to the
geographical distribution of the client requests to the component
● Si is the set of server hosting a replica of component i
19 EPFL – LSIR - Nicolas Bonvin
20. SLA Performance Guarantees (i)
● Each component has its own SLA constraints
● SLA derived directly from entry components
C2
C2 C4
C4
C1
C1
SLA :: 500ms
SLA 500ms
C3
C3 C5
C5
● Resp. Time = Service Time + max (Resp. Time of Dependencies)
20 EPFL – LSIR - Nicolas Bonvin
21. SLA Performance Guarantees (ii)
● SLA propagation from parents to children
● Parent j sends its performance constraints (e.g. response time upper
bound) to its dependencies D(j) :
● Child i computes its own performance constraints :
● : group of constraints sent by the replicas of the parent g
21 EPFL – LSIR - Nicolas Bonvin
22. SLA Performance Guarantees (iii)
● SLA propagation from parents to children
22 EPFL – LSIR - Nicolas Bonvin
23. Automatic Provisioning
● Usage of allocated resources is maximized :
– autonomic migration / replication / suicide of components
– not enough to ensure end-to-end response time
● Cloud resources managed by framework via cloud API
● Each individual component has to satisfy its own SLA
– SLA easily met -> decrease resources (scale down)
– SLA not met -> increase resources (scale up, scale out)
23 EPFL – LSIR - Nicolas Bonvin
24. Adaptivity to slow servers
● Each component keeps statistics about its children
– e.g. 95th perc. response time
● A routing coefficient is computed for each child at each epoch
– Send more requests to more performant children
24 EPFL – LSIR - Nicolas Bonvin