Exploration of CloudStack's distributed process management requirements and the challenges they present in the context of CAP theorem. These challenges will be addressed through a distributed process model that emphasizes efficiency, fault tolerance, and operational transparency.
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
How to Run from a Zombie: CloudStack Distributed Process Management
1. HOW TO RUN FROM A
ZOMBIE: CLOUDSTACK
DISTRIBUTED PROCESS
MANAGEMENT
John Burwell
(jburwell@apache.org | jburwell@basho.com
@john_burwell)
Tuesday, June 25, 13
2. I Am Not A Zombie
• Apache CloudStack PMC Member
• Consulting Engineer @ Basho Technologies
• Ran operations and designed automated provisioning for hybrid
analytic/virtualization clouds
• Led architectural design and server-side development of a SaaS
physical security platform
Tuesday, June 25, 13
3. Current Process Management
• No consistent system-wide model
• Fail slowly, fail quietly
• Resource overcommitment issues
• Lack of instrumentation
Tuesday, June 25, 13
33. Orchestration Coordination
1. Build a list of commands to be executed against a resource
2. Enqueue the list of commands to the resource management
layer for execution
3. A process applies the commands to the resource
4. Aggregate the results from the reply
Tuesday, June 25, 13
37. Unit Of Work (UoW)
• Definition:A ordered list of commands executed against a one
and only one resource.
• Created in the Orchestration layer
• Executed by processes in the resource management layer
• Failure of a command halts UoW execution
Tuesday, June 25, 13
38. Instrumentation
• Collect and report statistics on a per resource basis
• Inspect and remove pending UoWs for a resource
• Kill a running process
• View a history of UoWs completed by a resource
Tuesday, June 25, 13
39. • Process execution fails
• Resources become unavailable
• Slow consumers
When Gravity Fails
Tuesday, June 25, 13
40. Fail Fast; Fail Loudly
• If the resource can be returned to a consistent state, reply with
the process failure
• If the resource can not be returned to a consistent state, change
the transition the resource to a failure state, drain the queue of
pending UoWs, and reply with the process failure for each UoW
• The orchestration layer will determine the appropriate recovery
strategy (e.g. retry request on another resource)
Tuesday, June 25, 13
41. Preventing A Logjam
• Bounded Queues
• Request and Message Timeouts
• A failure to enqueue a request or a request timeout trigger a the
resource’s circuit breaker
Tuesday, June 25, 13
42. How could we implement this model?
Tuesday, June 25, 13
43. Lightweight Threads
A thread that is not scheduled by the
operating system -- avoiding context
switch overhead.
Tuesday, June 25, 13
44. Actor Model
• An actor represents state and behavior
• Communicate by message passing
• Each actor is allocated a lightweight thread and mailbox
• Location independent
Tuesday, June 25, 13
48. Java Actor Frameworks
• Akka (http://akka.io)
• Quasar (https://github.com/puniverse/quasar)
Tuesday, June 25, 13
49. Summary
• Orchestration and Resource Management must be properly
divided to satisfy CAP
• To provide resource serialization guarantees, assign a queue
and a process to each resource
• Fast fast, fail loudly
• An Actor Model based on lightweight threads may provide the
scalability required to dedicate a queue and process per
resource
Tuesday, June 25, 13