How to Run from a Zombie: CloudStack Distributed Process Management

1. HOW TO RUN FROM A ZOMBIE: CLOUDSTACK DISTRIBUTED PROCESS MANAGEMENT John Burwell (jburwell@apache.org | jburwell@basho.com @john_burwell) Tuesday, June 25, 13

2. I Am Not A Zombie • Apache CloudStack PMC Member • Consulting Engineer @ Basho Technologies • Ran operations and designed automated provisioning for hybrid analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform Tuesday, June 25, 13

3. Current Process Management • No consistent system-wide model • Fail slowly, fail quietly • Resource overcommitment issues • Lack of instrumentation Tuesday, June 25, 13

4. What is a cloud? Tuesday, June 25, 13

5. Tuesday, June 25, 13

6. Hopefully not ... Tuesday, June 25, 13

10. Hosts Virtual Routers Virtual Machines Primary Storage Networks Secondary Storage Load

11. Balancers Zone Cluster Pod Tuesday, June 25, 13

12. ResourceProcess State A

13. “thing”

14. with

16. bounded

17. capacity PartitionOrchestration Tuesday, June 25, 13

18. At it’s core, CloudStack ... Integrates infrastructure components Manages resources Tuesday, June 25, 13

20. Consistency AvailabilityPartition

21. Tolerance PICK 2 Tuesday, June 25, 13

22. CloudStack provides zones, clusters, and pods to partition resources. Tuesday, June 25, 13

23. Orchestration operations are eventually consistent Tuesday, June 25, 13

25. ... but resource operations must be consistent serialized. Tuesday, June 25, 13

27. A system can not be simultaneously consistent and available. Tuesday, June 25, 13

28. Orchestration

29. ProcessesAP CP Resource

30. Management

31. Processes Tuesday, June 25, 13

32. CP Resource? • Ordered/Serialized operations • Prevent overcommitment • Execution location independent • Lock free Tuesday, June 25, 13

33. Orchestration Coordination 1. Build a list of commands to be executed against a resource 2. Enqueue the list of commands to the resource management layer for execution 3. A process applies the commands to the resource 4. Aggregate the results from the reply Tuesday, June 25, 13

34. ResourceProcess State Queue 1 1 Unit

35. of

36. Work 1 1 Exclusive Consumer Tuesday, June 25, 13

37. Unit Of Work (UoW) • Deﬁnition:A ordered list of commands executed against a one and only one resource. • Created in the Orchestration layer • Executed by processes in the resource management layer • Failure of a command halts UoW execution Tuesday, June 25, 13

38. Instrumentation • Collect and report statistics on a per resource basis • Inspect and remove pending UoWs for a resource • Kill a running process • View a history of UoWs completed by a resource Tuesday, June 25, 13

39. • Process execution fails • Resources become unavailable • Slow consumers When Gravity Fails Tuesday, June 25, 13

40. Fail Fast; Fail Loudly • If the resource can be returned to a consistent state, reply with the process failure • If the resource can not be returned to a consistent state, change the transition the resource to a failure state, drain the queue of pending UoWs, and reply with the process failure for each UoW • The orchestration layer will determine the appropriate recovery strategy (e.g. retry request on another resource) Tuesday, June 25, 13

41. Preventing A Logjam • Bounded Queues • Request and Message Timeouts • A failure to enqueue a request or a request timeout trigger a the resource’s circuit breaker Tuesday, June 25, 13

42. How could we implement this model? Tuesday, June 25, 13

43. Lightweight Threads A thread that is not scheduled by the operating system -- avoiding context switch overhead. Tuesday, June 25, 13

44. Actor Model • An actor represents state and behavior • Communicate by message passing • Each actor is allocated a lightweight thread and mailbox • Location independent Tuesday, June 25, 13

45. Mailbox Resource Actor FSM Orchestration Unit

46. of

47. Work Tuesday, June 25, 13

48. Java Actor Frameworks • Akka (http://akka.io) • Quasar (https://github.com/puniverse/quasar) Tuesday, June 25, 13

49. Summary • Orchestration and Resource Management must be properly divided to satisfy CAP • To provide resource serialization guarantees, assign a queue and a process to each resource • Fast fast, fail loudly • An Actor Model based on lightweight threads may provide the scalability required to dedicate a queue and process per resource Tuesday, June 25, 13

50. Thoughts? Questions? Tuesday, June 25, 13

51. Thank you! Slides available @ http://speakerdeck.com/jburwell Tuesday, June 25, 13

How to Run from a Zombie: CloudStack Distributed Process Management

Recommandé

Recommandé

Contenu connexe

Similaire à How to Run from a Zombie: CloudStack Distributed Process Management

Similaire à How to Run from a Zombie: CloudStack Distributed Process Management (20)

Dernier

Dernier (20)

How to Run from a Zombie: CloudStack Distributed Process Management